Skip to content

✨ catalogd graphql shift to file-based cache#2732

Open
grokspawn wants to merge 4 commits into
operator-framework:mainfrom
grokspawn:graphql-file-cache
Open

✨ catalogd graphql shift to file-based cache#2732
grokspawn wants to merge 4 commits into
operator-framework:mainfrom
grokspawn:graphql-file-cache

Conversation

@grokspawn

@grokspawn grokspawn commented May 29, 2026

Copy link
Copy Markdown
Contributor

Description

This PR proposes a shift in the graphql service endpoint from in-memory to on-disk caching which leverages common architectural conventions like fan-out catalog index generation (graphql-schema.json) which is then accessed to fulfill queries to eliminate RSS increases from the feature initial implementation.

Included is a structured graphql validation package as a focus for future limits enforcement and adds several cold-cache/cache-miss concurrency tests.

Referenced against a diverse set of existing catalogs, this change reduces RSS 39X on average:

┌───────────────┬────────┬─────────┬──────────┬──────────┬───────────┐
│    Catalog    │  Raw   │ Objects │   OLD    │   NEW    │ Reduction │
│               │  Size  │         │  Memory  │  Memory  │           │
├───────────────┼────────┼─────────┼──────────┼──────────┼───────────┤
│ r-o-i/4.23    │ 31MB   │ 1,244   │ ~126MB   │ ~2MB     │ 63x       │
├───────────────┼────────┼─────────┼──────────┼──────────┼───────────┤
│ ce-o-i/4.22   │ 8.6MB  │ 954     │ ~35MB    │ ~2MB     │ 17x       │
├───────────────┼────────┼─────────┼──────────┼──────────┼───────────┤
│ co-o-i/4.22   │ 20MB   │ 4,958   │ ~79MB    │ ~2MB     │ 39x       │
└───────────────┴────────┴─────────┴──────────┴──────────┴───────────┘

The new file-system storage is approximately 200KB/catalog for the new file offsets index data.

Query latency shifts from microsecond access to low millisecond for the same access approach, with a transient per-query allocation between ~1-10MB depending on the size of the object(s) returned in the query.

NB: I ran some benchmarks after investigating repeated failures of the st2ex e2e test, and determined that unbounded FBC evaluation during schema detection phase was causing excessive delays in catalogd service readiness. Instituted a bound of ten (10) samples of a given Meta to determine the appropriate GraphQL schema type, which allows the functionality to complete operation on example catalogs above with a worst-case of 250ms on test hardware. These changes are in a separate commit to help them stand out.

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

Copilot AI review requested due to automatic review settings May 29, 2026 20:07
@openshift-ci openshift-ci Bot requested review from dtfranz and perdasilva May 29, 2026 20:07
@netlify

netlify Bot commented May 29, 2026

Copy link
Copy Markdown

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 77e8ecb
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/6a3aeb589b394e00088c58d5
😎 Deploy Preview https://deploy-preview-2732--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR shifts catalogd’s GraphQL implementation from building/caching schemas off in-memory catalog metas to a file-based approach: schema metadata is discovered during Store() and persisted to disk, while query execution loads requested objects from catalog.jsonl on-demand using byte offsets from index.json. It also introduces a query complexity validation helper and updates handlers/tests to use the new GraphQL service interface.

Changes:

  • Persist GraphQL schema metadata to graphql-schema.json during catalog storage and load it from disk for schema building.
  • Switch query execution to disk-backed object loading using index-based byte offsets (reducing RSS growth from caching parsed objects).
  • Add a GraphQL query complexity validation helper and expand concurrency/singleflight tests around cache misses/builds.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/catalogd/storage/localdir.go Writes graphql-schema.json on store, adds disk-backed object loader, and adjusts GraphQL pre-warm behavior.
internal/catalogd/storage/index.go Exposes schema-section byte ranges for disk-backed GraphQL pagination.
internal/catalogd/service/graphql_service.go Refactors service to use a storage-provided data provider and adds query validation/timeout.
internal/catalogd/service/graphql_service_test.go Updates tests to use the new provider-based GraphQL service and adds singleflight coverage via provider counting.
internal/catalogd/server/handlers.go Updates GraphQL handler to execute queries without needing a catalog FS.
internal/catalogd/server/handlers_test.go Updates mocks/tests for the new GraphQL service interface and error behavior.
internal/catalogd/graphql/validation.go Adds AST-based query complexity validation (depth/aliases/fields).
internal/catalogd/graphql/graphql.go Adds schema serialization/deserialization and switches to loader-based query-time object retrieval.
hack/demo/graphql-demo-server/main.go Updates demo server to build schema once and serve GraphQL directly from the generated schema.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/storage/localdir.go Outdated
Comment thread internal/catalogd/graphql/graphql.go
Comment thread internal/catalogd/graphql/validation.go
@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.32242% with 94 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.39%. Comparing base (e4879a3) to head (77e8ecb).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/catalogd/storage/localdir.go 67.02% 21 Missing and 10 partials ⚠️
hack/demo/graphql-demo-server/main.go 0.00% 27 Missing ⚠️
internal/catalogd/graphql/graphql.go 88.11% 15 Missing and 9 partials ⚠️
internal/catalogd/service/graphql_service.go 61.90% 7 Missing and 1 partial ⚠️
internal/catalogd/graphql/validation.go 90.90% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2732      +/-   ##
==========================================
+ Coverage   70.54%   71.39%   +0.85%     
==========================================
  Files         143      144       +1     
  Lines       10617    10888     +271     
==========================================
+ Hits         7490     7774     +284     
+ Misses       2568     2535      -33     
- Partials      559      579      +20     
Flag Coverage Δ
e2e 34.16% <1.35%> (-0.99%) ⬇️
experimental-e2e 52.20% <44.05%> (-0.39%) ⬇️
unit 61.07% <75.81%> (+1.57%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@grokspawn grokspawn force-pushed the graphql-file-cache branch from ec01fe5 to 374e69e Compare June 5, 2026 16:26
Copilot AI review requested due to automatic review settings June 5, 2026 17:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@grokspawn grokspawn force-pushed the graphql-file-cache branch from f704ad1 to 1abb2b6 Compare June 5, 2026 19:38
Copilot AI review requested due to automatic review settings June 8, 2026 19:26
@grokspawn grokspawn force-pushed the graphql-file-cache branch from 1abb2b6 to 04a2bc0 Compare June 8, 2026 19:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/graphql/validation.go
@grokspawn grokspawn force-pushed the graphql-file-cache branch from 04a2bc0 to e048b50 Compare June 22, 2026 16:21
@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pedjak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: grokspawn <jordan@nimblewidget.com>
Copilot AI review requested due to automatic review settings June 22, 2026 18:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • internal/testutil/mock/catalogdservice/mock_graphqlservice.go: Generated file

Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/graphql/graphql.go
Comment thread internal/catalogd/graphql/validation.go
Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/storage/localdir.go
Signed-off-by: grokspawn <jordan@nimblewidget.com>
Copilot AI review requested due to automatic review settings June 23, 2026 20:23
@grokspawn grokspawn force-pushed the graphql-file-cache branch from ca13bf8 to 77e8ecb Compare June 23, 2026 20:23

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • internal/testutil/mock/catalogdservice/mock_graphqlservice.go: Generated file

Comment thread internal/catalogd/storage/localdir.go
Comment thread internal/catalogd/graphql/validation.go

@perdasilva perdasilva left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — 10 findings (3 correctness, 1 performance, 6 cleanup/structural)

The most impactful fix would be keeping the write lock held through the entire Store() operation (or using per-catalog locks), which would resolve findings 1–3 in one change.

Finding not placeable inline:

Performance — O(N×regex) field resolution on every query (graphql.go:678)

createFieldResolver iterates all source keys calling remapFieldName (3 regex replacements each) to find the matching key. For a query returning 100 objects with 20 fields, requesting 5 fields: 5 × 100 × 20 × 3 = 30,000 regex operations per query.

FieldInfo.OriginalName is already stored during schema discovery — the resolver could capture it and do a direct O(1) map lookup:

func createFieldResolver(fieldName, originalName string) graphql.FieldResolveFn {
    return func(p graphql.ResolveParams) (interface{}, error) {
        if source, ok := p.Source.(map[string]interface{}); ok {
            if value, ok := source[originalName]; ok {
                return marshalComplexValue(value), nil
            }
        }
        return nil, nil
    }
}

Same issue in createNestedFieldResolver (line 698) and createNestedObjectType resolver (line 653).

🤖 Generated with Claude Code

if _, err := s.graphqlSvc.GetSchema(ctx, catalog); err != nil {
s.graphqlSvc.InvalidateCache(catalog)
s.m.Lock()
removeErr := os.RemoveAll(catalogDir)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — Store() rollback race

After the write lock is released at line 90, a concurrent Store("same-catalog", fsB) can complete its own storeAtomicSwap, replacing the catalog directory with valid fsB data. If this goroutine's GetSchema then fails, os.RemoveAll(catalogDir) here deletes the other goroutine's valid catalog directory — catalogDir is a deterministic path (rootDir/catalogName), not the one this goroutine wrote.

The result: both Stores fail, and the catalog is left in a missing state despite fsB being valid.

Consider keeping the write lock held through the entire Store() operation, or using per-catalog locks.

s.m.Unlock()
return err
}
s.m.Unlock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — stale cached schema serves queries against replaced catalog file

Between s.m.Unlock() here and InvalidateCache at line 93, a concurrent GraphQL query can hit the old cached schema. That schema's ObjectLoader holds byte offsets from the previous index file. Since storeAtomicSwap already replaced catalog.jsonl with different content, the ObjectLoader reads the new file at old offsets → corrupted/wrong JSON objects.

This is the same root cause as the rollback race: the lock is released too early. Holding the lock through InvalidateCache (or through the entire pre-warm) would close this window.


storeMetaFuncs := []storeMetasFunc{storeCatalogData}
if s.EnableMetasHandler {
if s.EnableMetasHandler || s.EnableGraphQLQueries == GraphQLQueriesEnabled {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — ContentExists doesn't verify GraphQL-required files

storeIndexData is now enabled when GraphQL is enabled (even with MetasHandlerDisabled), and discoverAndStoreSchema writes graphql-schema.json. But ContentExists (line 230) only checks index.json when EnableMetasHandler is set — it never checks graphql-schema.json.

If a crash leaves catalog.jsonl intact but graphql-schema.json or index.json missing, ContentExists returns true while GraphQL queries fail with file-not-found from LoadCatalogSchema/loadIndex.

if s.EnableGraphQLQueries == GraphQLQueriesEnabled {
    // check index.json and graphql-schema.json too
}

}

info := catalogSchema.Schemas[meta.Schema]
info.TotalObjects++

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug — TotalObjects counts blobs that fail json.Unmarshal

TotalObjects is incremented here before the json.Unmarshal at line 597. If a blob is malformed, TotalObjects includes it but the ObjectLoader (localdir.go:365) silently skips parse failures.

A client using summary.totalObjects for pagination will request beyond the actual data. Move the increment after the successful unmarshal, or count failures separately.

// FieldInfo represents discovered field information
type FieldInfo struct {
Name string
OriginalName string // The original JSON key before remapping

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup — SampleValues (line 34) and SampleObject (line 42) are dead state

OriginalName is a great addition. However, SampleValues is populated during schema discovery (lines 432, 444–448, 477, 490–491) via appendUnique — which JSON-marshals all existing elements on every call — but is never read by any consumer: not by serialization (serializableFieldInfo omits it), not by BuildDynamicGraphQLSchema, not by resolvers, not by PrintCatalogSummary.

Same for SampleObject on SchemaInfo (line 42) — written at lines 562–563 and 603–604, never read.

Removing both fields and appendUnique would eliminate ~40 lines and the per-meta JSON marshaling overhead during every Store operation.

// DiscoverSchemaFromChannel performs streaming schema discovery, processing
// one meta at a time through a channel. Each meta's blob is parsed, analyzed,
// and then goes out of scope — avoiding accumulation of all blobs in memory.
func DiscoverSchemaFromChannel(metasChan <-chan *declcfg.Meta) (*CatalogSchema, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup — duplicates DiscoverSchemaFromMetas body

This function's body (lines 576–611) is nearly identical to DiscoverSchemaFromMetas (lines 531–571): init CatalogSchema, skip empty schema, ensure map entry, increment TotalObjects, unmarshal blob, set SampleObject, call analyzeJSONObject. Only the iteration source differs (channel vs slice).

A shared processMeta(catalogSchema, meta) helper would reduce both to a ~5-line loop, eliminating ~35 duplicated lines and the risk of divergence (e.g., if the TotalObjects counting bug above is fixed in one but not the other).


// loadIndex loads the index from disk using singleflight for efficiency.
func (s *LocalDirV1) loadIndex(catalog string) (*index, error) {
s.m.RLock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Efficiency — RLock held across singleflight disk I/O

loadIndex acquires RLock here and holds it through the entire singleflight.Do (line 379), which opens and JSON-decodes the index file. All concurrent callers sharing the singleflight also hold RLocks while waiting for the leader to finish.

During this time, Store() cannot acquire the write lock. For a large index or slow disk, this creates unbounded reader starvation of writers. Consider acquiring the RLock only for the path computation, or using a separate lock for the singleflight.


c := &queryComplexity{
fragments: make(map[string]*ast.FragmentDefinition),
visited: make(map[string]bool),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor bug — fragment visited map shared across operations

The visited map is created once per ValidateQueryComplexity call and shared across all OperationDefinitions in the document (line 40–46). If two operations both spread the same fragment (query A { ...F } query B { ...F }), validating op A marks fragment F as visited (line 83). Validating op B finds F already visited (line 80) and skips counting its fields.

This under-counts complexity for multi-operation documents with shared fragments. Low impact since clients typically send one operation per request, but the fix is simple: reset visited per operation.

// serializableFieldInfo is a JSON-friendly representation of FieldInfo
type serializableFieldInfo struct {
Name string `json:"name"`
OriginalName string `json:"originalName"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structural — serializable shadow hierarchy adds maintenance burden

serializableFieldInfo/serializableSchemaInfo/serializableCatalogSchema plus 6 converter functions (~130 lines) exist because FieldInfo stores non-JSON-serializable reflect.Kind and graphql.Type.

But both are derivable: reflect.Kind round-trips through a string, and graphql.Type is already reconstructed from JSONType + IsArray via jsonTypeToGraphQL. If FieldInfo stored a string enum for JSONType and dropped GraphQLType (reconstructing it lazily in BuildDynamicGraphQLSchema), this entire shadow hierarchy collapses into plain json.Marshal/Unmarshal with struct tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants