✨ catalogd graphql shift to file-based cache#2732
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR shifts catalogd’s GraphQL implementation from building/caching schemas off in-memory catalog metas to a file-based approach: schema metadata is discovered during Store() and persisted to disk, while query execution loads requested objects from catalog.jsonl on-demand using byte offsets from index.json. It also introduces a query complexity validation helper and updates handlers/tests to use the new GraphQL service interface.
Changes:
- Persist GraphQL schema metadata to
graphql-schema.jsonduring catalog storage and load it from disk for schema building. - Switch query execution to disk-backed object loading using index-based byte offsets (reducing RSS growth from caching parsed objects).
- Add a GraphQL query complexity validation helper and expand concurrency/singleflight tests around cache misses/builds.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/catalogd/storage/localdir.go | Writes graphql-schema.json on store, adds disk-backed object loader, and adjusts GraphQL pre-warm behavior. |
| internal/catalogd/storage/index.go | Exposes schema-section byte ranges for disk-backed GraphQL pagination. |
| internal/catalogd/service/graphql_service.go | Refactors service to use a storage-provided data provider and adds query validation/timeout. |
| internal/catalogd/service/graphql_service_test.go | Updates tests to use the new provider-based GraphQL service and adds singleflight coverage via provider counting. |
| internal/catalogd/server/handlers.go | Updates GraphQL handler to execute queries without needing a catalog FS. |
| internal/catalogd/server/handlers_test.go | Updates mocks/tests for the new GraphQL service interface and error behavior. |
| internal/catalogd/graphql/validation.go | Adds AST-based query complexity validation (depth/aliases/fields). |
| internal/catalogd/graphql/graphql.go | Adds schema serialization/deserialization and switches to loader-based query-time object retrieval. |
| hack/demo/graphql-demo-server/main.go | Updates demo server to build schema once and serve GraphQL directly from the generated schema. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2732 +/- ##
==========================================
+ Coverage 70.54% 71.39% +0.85%
==========================================
Files 143 144 +1
Lines 10617 10888 +271
==========================================
+ Hits 7490 7774 +284
+ Misses 2568 2535 -33
- Partials 559 579 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
ec01fe5 to
374e69e
Compare
f704ad1 to
1abb2b6
Compare
1abb2b6 to
04a2bc0
Compare
Signed-off-by: grokspawn <jordan@nimblewidget.com>
04a2bc0 to
e048b50
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: grokspawn <jordan@nimblewidget.com>
Signed-off-by: grokspawn <jordan@nimblewidget.com>
ca13bf8 to
77e8ecb
Compare
perdasilva
left a comment
There was a problem hiding this comment.
Code Review — 10 findings (3 correctness, 1 performance, 6 cleanup/structural)
The most impactful fix would be keeping the write lock held through the entire Store() operation (or using per-catalog locks), which would resolve findings 1–3 in one change.
Finding not placeable inline:
Performance — O(N×regex) field resolution on every query (graphql.go:678)
createFieldResolver iterates all source keys calling remapFieldName (3 regex replacements each) to find the matching key. For a query returning 100 objects with 20 fields, requesting 5 fields: 5 × 100 × 20 × 3 = 30,000 regex operations per query.
FieldInfo.OriginalName is already stored during schema discovery — the resolver could capture it and do a direct O(1) map lookup:
func createFieldResolver(fieldName, originalName string) graphql.FieldResolveFn {
return func(p graphql.ResolveParams) (interface{}, error) {
if source, ok := p.Source.(map[string]interface{}); ok {
if value, ok := source[originalName]; ok {
return marshalComplexValue(value), nil
}
}
return nil, nil
}
}Same issue in createNestedFieldResolver (line 698) and createNestedObjectType resolver (line 653).
🤖 Generated with Claude Code
| if _, err := s.graphqlSvc.GetSchema(ctx, catalog); err != nil { | ||
| s.graphqlSvc.InvalidateCache(catalog) | ||
| s.m.Lock() | ||
| removeErr := os.RemoveAll(catalogDir) |
There was a problem hiding this comment.
Bug — Store() rollback race
After the write lock is released at line 90, a concurrent Store("same-catalog", fsB) can complete its own storeAtomicSwap, replacing the catalog directory with valid fsB data. If this goroutine's GetSchema then fails, os.RemoveAll(catalogDir) here deletes the other goroutine's valid catalog directory — catalogDir is a deterministic path (rootDir/catalogName), not the one this goroutine wrote.
The result: both Stores fail, and the catalog is left in a missing state despite fsB being valid.
Consider keeping the write lock held through the entire Store() operation, or using per-catalog locks.
| s.m.Unlock() | ||
| return err | ||
| } | ||
| s.m.Unlock() |
There was a problem hiding this comment.
Bug — stale cached schema serves queries against replaced catalog file
Between s.m.Unlock() here and InvalidateCache at line 93, a concurrent GraphQL query can hit the old cached schema. That schema's ObjectLoader holds byte offsets from the previous index file. Since storeAtomicSwap already replaced catalog.jsonl with different content, the ObjectLoader reads the new file at old offsets → corrupted/wrong JSON objects.
This is the same root cause as the rollback race: the lock is released too early. Holding the lock through InvalidateCache (or through the entire pre-warm) would close this window.
|
|
||
| storeMetaFuncs := []storeMetasFunc{storeCatalogData} | ||
| if s.EnableMetasHandler { | ||
| if s.EnableMetasHandler || s.EnableGraphQLQueries == GraphQLQueriesEnabled { |
There was a problem hiding this comment.
Bug — ContentExists doesn't verify GraphQL-required files
storeIndexData is now enabled when GraphQL is enabled (even with MetasHandlerDisabled), and discoverAndStoreSchema writes graphql-schema.json. But ContentExists (line 230) only checks index.json when EnableMetasHandler is set — it never checks graphql-schema.json.
If a crash leaves catalog.jsonl intact but graphql-schema.json or index.json missing, ContentExists returns true while GraphQL queries fail with file-not-found from LoadCatalogSchema/loadIndex.
if s.EnableGraphQLQueries == GraphQLQueriesEnabled {
// check index.json and graphql-schema.json too
}| } | ||
|
|
||
| info := catalogSchema.Schemas[meta.Schema] | ||
| info.TotalObjects++ |
There was a problem hiding this comment.
Bug — TotalObjects counts blobs that fail json.Unmarshal
TotalObjects is incremented here before the json.Unmarshal at line 597. If a blob is malformed, TotalObjects includes it but the ObjectLoader (localdir.go:365) silently skips parse failures.
A client using summary.totalObjects for pagination will request beyond the actual data. Move the increment after the successful unmarshal, or count failures separately.
| // FieldInfo represents discovered field information | ||
| type FieldInfo struct { | ||
| Name string | ||
| OriginalName string // The original JSON key before remapping |
There was a problem hiding this comment.
Cleanup — SampleValues (line 34) and SampleObject (line 42) are dead state
OriginalName is a great addition. However, SampleValues is populated during schema discovery (lines 432, 444–448, 477, 490–491) via appendUnique — which JSON-marshals all existing elements on every call — but is never read by any consumer: not by serialization (serializableFieldInfo omits it), not by BuildDynamicGraphQLSchema, not by resolvers, not by PrintCatalogSummary.
Same for SampleObject on SchemaInfo (line 42) — written at lines 562–563 and 603–604, never read.
Removing both fields and appendUnique would eliminate ~40 lines and the per-meta JSON marshaling overhead during every Store operation.
| // DiscoverSchemaFromChannel performs streaming schema discovery, processing | ||
| // one meta at a time through a channel. Each meta's blob is parsed, analyzed, | ||
| // and then goes out of scope — avoiding accumulation of all blobs in memory. | ||
| func DiscoverSchemaFromChannel(metasChan <-chan *declcfg.Meta) (*CatalogSchema, error) { |
There was a problem hiding this comment.
Cleanup — duplicates DiscoverSchemaFromMetas body
This function's body (lines 576–611) is nearly identical to DiscoverSchemaFromMetas (lines 531–571): init CatalogSchema, skip empty schema, ensure map entry, increment TotalObjects, unmarshal blob, set SampleObject, call analyzeJSONObject. Only the iteration source differs (channel vs slice).
A shared processMeta(catalogSchema, meta) helper would reduce both to a ~5-line loop, eliminating ~35 duplicated lines and the risk of divergence (e.g., if the TotalObjects counting bug above is fixed in one but not the other).
|
|
||
| // loadIndex loads the index from disk using singleflight for efficiency. | ||
| func (s *LocalDirV1) loadIndex(catalog string) (*index, error) { | ||
| s.m.RLock() |
There was a problem hiding this comment.
Efficiency — RLock held across singleflight disk I/O
loadIndex acquires RLock here and holds it through the entire singleflight.Do (line 379), which opens and JSON-decodes the index file. All concurrent callers sharing the singleflight also hold RLocks while waiting for the leader to finish.
During this time, Store() cannot acquire the write lock. For a large index or slow disk, this creates unbounded reader starvation of writers. Consider acquiring the RLock only for the path computation, or using a separate lock for the singleflight.
|
|
||
| c := &queryComplexity{ | ||
| fragments: make(map[string]*ast.FragmentDefinition), | ||
| visited: make(map[string]bool), |
There was a problem hiding this comment.
Minor bug — fragment visited map shared across operations
The visited map is created once per ValidateQueryComplexity call and shared across all OperationDefinitions in the document (line 40–46). If two operations both spread the same fragment (query A { ...F } query B { ...F }), validating op A marks fragment F as visited (line 83). Validating op B finds F already visited (line 80) and skips counting its fields.
This under-counts complexity for multi-operation documents with shared fragments. Low impact since clients typically send one operation per request, but the fix is simple: reset visited per operation.
| // serializableFieldInfo is a JSON-friendly representation of FieldInfo | ||
| type serializableFieldInfo struct { | ||
| Name string `json:"name"` | ||
| OriginalName string `json:"originalName"` |
There was a problem hiding this comment.
Structural — serializable shadow hierarchy adds maintenance burden
serializableFieldInfo/serializableSchemaInfo/serializableCatalogSchema plus 6 converter functions (~130 lines) exist because FieldInfo stores non-JSON-serializable reflect.Kind and graphql.Type.
But both are derivable: reflect.Kind round-trips through a string, and graphql.Type is already reconstructed from JSONType + IsArray via jsonTypeToGraphQL. If FieldInfo stored a string enum for JSONType and dropped GraphQLType (reconstructing it lazily in BuildDynamicGraphQLSchema), this entire shadow hierarchy collapses into plain json.Marshal/Unmarshal with struct tags.
Description
This PR proposes a shift in the graphql service endpoint from in-memory to on-disk caching which leverages common architectural conventions like fan-out catalog index generation (graphql-schema.json) which is then accessed to fulfill queries to eliminate RSS increases from the feature initial implementation.
Included is a structured graphql validation package as a focus for future limits enforcement and adds several cold-cache/cache-miss concurrency tests.
Referenced against a diverse set of existing catalogs, this change reduces RSS 39X on average:
The new file-system storage is approximately 200KB/catalog for the new file offsets index data.
Query latency shifts from microsecond access to low millisecond for the same access approach, with a transient per-query allocation between ~1-10MB depending on the size of the object(s) returned in the query.
NB: I ran some benchmarks after investigating repeated failures of the st2ex e2e test, and determined that unbounded FBC evaluation during schema detection phase was causing excessive delays in catalogd service readiness. Instituted a bound of ten (10) samples of a given Meta to determine the appropriate GraphQL schema type, which allows the functionality to complete operation on example catalogs above with a worst-case of 250ms on test hardware. These changes are in a separate commit to help them stand out.
Reviewer Checklist