✨ catalogd graphql shift to file-based cache by grokspawn · Pull Request #2732 · operator-framework/operator-controller

grokspawn · 2026-05-29T20:07:19Z

Description

This PR proposes a shift in the graphql service endpoint from in-memory to on-disk caching which leverages common architectural conventions like fan-out catalog index generation (graphql-schema.json) which is then accessed to fulfill queries to eliminate RSS increases from the feature initial implementation.

Included is a structured graphql validation package as a focus for future limits enforcement and adds several cold-cache/cache-miss concurrency tests.

Referenced against a diverse set of existing catalogs, this change reduces RSS 39X on average:

┌───────────────┬────────┬─────────┬──────────┬──────────┬───────────┐
│    Catalog    │  Raw   │ Objects │   OLD    │   NEW    │ Reduction │
│               │  Size  │         │  Memory  │  Memory  │           │
├───────────────┼────────┼─────────┼──────────┼──────────┼───────────┤
│ r-o-i/4.23    │ 31MB   │ 1,244   │ ~126MB   │ ~2MB     │ 63x       │
├───────────────┼────────┼─────────┼──────────┼──────────┼───────────┤
│ ce-o-i/4.22   │ 8.6MB  │ 954     │ ~35MB    │ ~2MB     │ 17x       │
├───────────────┼────────┼─────────┼──────────┼──────────┼───────────┤
│ co-o-i/4.22   │ 20MB   │ 4,958   │ ~79MB    │ ~2MB     │ 39x       │
└───────────────┴────────┴─────────┴──────────┴──────────┴───────────┘

The new file-system storage is approximately 200KB/catalog for the new file offsets index data.

Query latency shifts from microsecond access to low millisecond for the same access approach, with a transient per-query allocation between ~1-10MB depending on the size of the object(s) returned in the query.

NB: I ran some benchmarks after investigating repeated failures of the st2ex e2e test, and determined that unbounded FBC evaluation during schema detection phase was causing excessive delays in catalogd service readiness. Instituted a bound of ten (10) samples of a given Meta to determine the appropriate GraphQL schema type, which allows the functionality to complete operation on example catalogs above with a worst-case of 250ms on test hardware. These changes are in a separate commit to help them stand out.

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

netlify · 2026-05-29T20:07:25Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`77e8ecb`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/6a3aeb589b394e00088c58d5
😎 Deploy Preview	https://deploy-preview-2732--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.
🤖 Make changes	Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot

Pull request overview

This PR shifts catalogd’s GraphQL implementation from building/caching schemas off in-memory catalog metas to a file-based approach: schema metadata is discovered during Store() and persisted to disk, while query execution loads requested objects from catalog.jsonl on-demand using byte offsets from index.json. It also introduces a query complexity validation helper and updates handlers/tests to use the new GraphQL service interface.

Changes:

Persist GraphQL schema metadata to graphql-schema.json during catalog storage and load it from disk for schema building.
Switch query execution to disk-backed object loading using index-based byte offsets (reducing RSS growth from caching parsed objects).
Add a GraphQL query complexity validation helper and expand concurrency/singleflight tests around cache misses/builds.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
internal/catalogd/storage/localdir.go	Writes `graphql-schema.json` on store, adds disk-backed object loader, and adjusts GraphQL pre-warm behavior.
internal/catalogd/storage/index.go	Exposes schema-section byte ranges for disk-backed GraphQL pagination.
internal/catalogd/service/graphql_service.go	Refactors service to use a storage-provided data provider and adds query validation/timeout.
internal/catalogd/service/graphql_service_test.go	Updates tests to use the new provider-based GraphQL service and adds singleflight coverage via provider counting.
internal/catalogd/server/handlers.go	Updates GraphQL handler to execute queries without needing a catalog FS.
internal/catalogd/server/handlers_test.go	Updates mocks/tests for the new GraphQL service interface and error behavior.
internal/catalogd/graphql/validation.go	Adds AST-based query complexity validation (depth/aliases/fields).
internal/catalogd/graphql/graphql.go	Adds schema serialization/deserialization and switches to loader-based query-time object retrieval.
hack/demo/graphql-demo-server/main.go	Updates demo server to build schema once and serve GraphQL directly from the generated schema.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-05-29T20:26:51Z

Codecov Report

❌ Patch coverage is 76.32242% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.39%. Comparing base (e4879a3) to head (77e8ecb).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/catalogd/storage/localdir.go	67.02%	21 Missing and 10 partials ⚠️
hack/demo/graphql-demo-server/main.go	0.00%	27 Missing ⚠️
internal/catalogd/graphql/graphql.go	88.11%	15 Missing and 9 partials ⚠️
internal/catalogd/service/graphql_service.go	61.90%	7 Missing and 1 partial ⚠️
internal/catalogd/graphql/validation.go	90.90%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2732      +/-   ##
==========================================
+ Coverage   70.54%   71.39%   +0.85%     
==========================================
  Files         143      144       +1     
  Lines       10617    10888     +271     
==========================================
+ Hits         7490     7774     +284     
+ Misses       2568     2535      -33     
- Partials      559      579      +20

Flag	Coverage Δ
e2e	`34.16% <1.35%> (-0.99%)`	⬇️
experimental-e2e	`52.20% <44.05%> (-0.39%)`	⬇️
unit	`61.07% <75.81%> (+1.57%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Signed-off-by: grokspawn <jordan@nimblewidget.com>

openshift-ci · 2026-06-22T16:21:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pedjak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: grokspawn <jordan@nimblewidget.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

internal/testutil/mock/catalogdservice/mock_graphqlservice.go: Generated file

Signed-off-by: grokspawn <jordan@nimblewidget.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

internal/testutil/mock/catalogdservice/mock_graphqlservice.go: Generated file

perdasilva

Code Review — 10 findings (3 correctness, 1 performance, 6 cleanup/structural)

The most impactful fix would be keeping the write lock held through the entire Store() operation (or using per-catalog locks), which would resolve findings 1–3 in one change.

Finding not placeable inline:

Performance — O(N×regex) field resolution on every query (graphql.go:678)

createFieldResolver iterates all source keys calling remapFieldName (3 regex replacements each) to find the matching key. For a query returning 100 objects with 20 fields, requesting 5 fields: 5 × 100 × 20 × 3 = 30,000 regex operations per query.

FieldInfo.OriginalName is already stored during schema discovery — the resolver could capture it and do a direct O(1) map lookup:

func createFieldResolver(fieldName, originalName string) graphql.FieldResolveFn {
    return func(p graphql.ResolveParams) (interface{}, error) {
        if source, ok := p.Source.(map[string]interface{}); ok {
            if value, ok := source[originalName]; ok {
                return marshalComplexValue(value), nil
            }
        }
        return nil, nil
    }
}

Same issue in createNestedFieldResolver (line 698) and createNestedObjectType resolver (line 653).

🤖 Generated with Claude Code

perdasilva · 2026-06-26T09:52:13Z

+		if _, err := s.graphqlSvc.GetSchema(ctx, catalog); err != nil {
+			s.graphqlSvc.InvalidateCache(catalog)
+			s.m.Lock()
+			removeErr := os.RemoveAll(catalogDir)


Bug — Store() rollback race

After the write lock is released at line 90, a concurrent Store("same-catalog", fsB) can complete its own storeAtomicSwap, replacing the catalog directory with valid fsB data. If this goroutine's GetSchema then fails, os.RemoveAll(catalogDir) here deletes the other goroutine's valid catalog directory — catalogDir is a deterministic path (rootDir/catalogName), not the one this goroutine wrote.

The result: both Stores fail, and the catalog is left in a missing state despite fsB being valid.

Consider keeping the write lock held through the entire Store() operation, or using per-catalog locks.

perdasilva · 2026-06-26T09:52:14Z

+		s.m.Unlock()
+		return err
+	}
+	s.m.Unlock()


Bug — stale cached schema serves queries against replaced catalog file

Between s.m.Unlock() here and InvalidateCache at line 93, a concurrent GraphQL query can hit the old cached schema. That schema's ObjectLoader holds byte offsets from the previous index file. Since storeAtomicSwap already replaced catalog.jsonl with different content, the ObjectLoader reads the new file at old offsets → corrupted/wrong JSON objects.

This is the same root cause as the rollback race: the lock is released too early. Holding the lock through InvalidateCache (or through the entire pre-warm) would close this window.

perdasilva · 2026-06-26T09:52:14Z


 	storeMetaFuncs := []storeMetasFunc{storeCatalogData}
-	if s.EnableMetasHandler {
+	if s.EnableMetasHandler || s.EnableGraphQLQueries == GraphQLQueriesEnabled {


Bug — ContentExists doesn't verify GraphQL-required files

storeIndexData is now enabled when GraphQL is enabled (even with MetasHandlerDisabled), and discoverAndStoreSchema writes graphql-schema.json. But ContentExists (line 230) only checks index.json when EnableMetasHandler is set — it never checks graphql-schema.json.

If a crash leaves catalog.jsonl intact but graphql-schema.json or index.json missing, ContentExists returns true while GraphQL queries fail with file-not-found from LoadCatalogSchema/loadIndex.

if s.EnableGraphQLQueries == GraphQLQueriesEnabled { // check index.json and graphql-schema.json too }

perdasilva · 2026-06-26T09:52:14Z

+		}
+
+		info := catalogSchema.Schemas[meta.Schema]
+		info.TotalObjects++


Bug — TotalObjects counts blobs that fail json.Unmarshal

TotalObjects is incremented here before the json.Unmarshal at line 597. If a blob is malformed, TotalObjects includes it but the ObjectLoader (localdir.go:365) silently skips parse failures.

A client using summary.totalObjects for pagination will request beyond the actual data. Move the increment after the successful unmarshal, or count failures separately.

perdasilva · 2026-06-26T09:52:14Z

 // FieldInfo represents discovered field information
 type FieldInfo struct {
 	Name         string
+	OriginalName string // The original JSON key before remapping


Cleanup — SampleValues (line 34) and SampleObject (line 42) are dead state

OriginalName is a great addition. However, SampleValues is populated during schema discovery (lines 432, 444–448, 477, 490–491) via appendUnique — which JSON-marshals all existing elements on every call — but is never read by any consumer: not by serialization (serializableFieldInfo omits it), not by BuildDynamicGraphQLSchema, not by resolvers, not by PrintCatalogSummary.

Same for SampleObject on SchemaInfo (line 42) — written at lines 562–563 and 603–604, never read.

Removing both fields and appendUnique would eliminate ~40 lines and the per-meta JSON marshaling overhead during every Store operation.

perdasilva · 2026-06-26T09:52:14Z

+// DiscoverSchemaFromChannel performs streaming schema discovery, processing
+// one meta at a time through a channel. Each meta's blob is parsed, analyzed,
+// and then goes out of scope — avoiding accumulation of all blobs in memory.
+func DiscoverSchemaFromChannel(metasChan <-chan *declcfg.Meta) (*CatalogSchema, error) {


Cleanup — duplicates DiscoverSchemaFromMetas body

This function's body (lines 576–611) is nearly identical to DiscoverSchemaFromMetas (lines 531–571): init CatalogSchema, skip empty schema, ensure map entry, increment TotalObjects, unmarshal blob, set SampleObject, call analyzeJSONObject. Only the iteration source differs (channel vs slice).

A shared processMeta(catalogSchema, meta) helper would reduce both to a ~5-line loop, eliminating ~35 duplicated lines and the risk of divergence (e.g., if the TotalObjects counting bug above is fixed in one but not the other).

perdasilva · 2026-06-26T09:52:14Z

+
+// loadIndex loads the index from disk using singleflight for efficiency.
+func (s *LocalDirV1) loadIndex(catalog string) (*index, error) {
+	s.m.RLock()


Efficiency — RLock held across singleflight disk I/O

loadIndex acquires RLock here and holds it through the entire singleflight.Do (line 379), which opens and JSON-decodes the index file. All concurrent callers sharing the singleflight also hold RLocks while waiting for the leader to finish.

During this time, Store() cannot acquire the write lock. For a large index or slow disk, this creates unbounded reader starvation of writers. Consider acquiring the RLock only for the path computation, or using a separate lock for the singleflight.

perdasilva · 2026-06-26T09:52:14Z

+
+	c := &queryComplexity{
+		fragments: make(map[string]*ast.FragmentDefinition),
+		visited:   make(map[string]bool),


Minor bug — fragment visited map shared across operations

The visited map is created once per ValidateQueryComplexity call and shared across all OperationDefinitions in the document (line 40–46). If two operations both spread the same fragment (query A { ...F } query B { ...F }), validating op A marks fragment F as visited (line 83). Validating op B finds F already visited (line 80) and skips counting its fields.

This under-counts complexity for multi-operation documents with shared fragments. Low impact since clients typically send one operation per request, but the fix is simple: reset visited per operation.

perdasilva · 2026-06-26T09:52:14Z

+// serializableFieldInfo is a JSON-friendly representation of FieldInfo
+type serializableFieldInfo struct {
+	Name            string                            `json:"name"`
+	OriginalName    string                            `json:"originalName"`


Structural — serializable shadow hierarchy adds maintenance burden

serializableFieldInfo/serializableSchemaInfo/serializableCatalogSchema plus 6 converter functions (~130 lines) exist because FieldInfo stores non-JSON-serializable reflect.Kind and graphql.Type.

But both are derivable: reflect.Kind round-trips through a string, and graphql.Type is already reconstructed from JSONType + IsArray via jsonTypeToGraphQL. If FieldInfo stored a string enum for JSONType and dropped GraphQLType (reconstructing it lazily in BuildDynamicGraphQLSchema), this entire shadow hierarchy collapses into plain json.Marshal/Unmarshal with struct tags.

Copilot AI review requested due to automatic review settings May 29, 2026 20:07

openshift-ci Bot requested review from dtfranz and perdasilva May 29, 2026 20:07

Copilot started reviewing on behalf of grokspawn May 29, 2026 20:07 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

Comment thread internal/catalogd/storage/localdir.go

Comment thread internal/catalogd/storage/localdir.go

Comment thread internal/catalogd/storage/localdir.go Outdated

Comment thread internal/catalogd/graphql/graphql.go

Comment thread internal/catalogd/graphql/validation.go

grokspawn mentioned this pull request May 29, 2026

OPRUN-4042: olmv1 catalogd graphql service openshift/enhancements#2025

Open

grokspawn force-pushed the graphql-file-cache branch from ec01fe5 to 374e69e Compare June 5, 2026 16:26

Copilot AI review requested due to automatic review settings June 5, 2026 17:05

Copilot started reviewing on behalf of grokspawn June 5, 2026 17:05 View session

Copilot AI reviewed Jun 5, 2026

grokspawn force-pushed the graphql-file-cache branch from f704ad1 to 1abb2b6 Compare June 5, 2026 19:38

grokspawn mentioned this pull request Jun 8, 2026

[epic] Add graphql catalogd service endpoint #2727

Open

Copilot AI review requested due to automatic review settings June 8, 2026 19:26

grokspawn force-pushed the graphql-file-cache branch from 1abb2b6 to 04a2bc0 Compare June 8, 2026 19:26

Copilot started reviewing on behalf of grokspawn June 8, 2026 19:26 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Comment thread internal/catalogd/storage/localdir.go

Comment thread internal/catalogd/storage/localdir.go

Comment thread internal/catalogd/graphql/validation.go

grokspawn added 2 commits June 22, 2026 11:20

shift to file-based cache, leveraging existing metas indexer

2ff059a

review comments

e048b50

Signed-off-by: grokspawn <jordan@nimblewidget.com>

grokspawn force-pushed the graphql-file-cache branch from 04a2bc0 to e048b50 Compare June 22, 2026 16:21

bounds-protect type evaluation depth

b9fca8b

Signed-off-by: grokspawn <jordan@nimblewidget.com>

Copilot AI review requested due to automatic review settings June 22, 2026 18:22

Copilot started reviewing on behalf of grokspawn June 22, 2026 18:23 View session

Copilot AI reviewed Jun 22, 2026

View reviewed changes

Comment thread internal/catalogd/storage/localdir.go

Comment thread internal/catalogd/storage/localdir.go