Skip to content

fix(firo): make Spark anon-set sync resumable and crash-safe#1298

Open
reubenyap wants to merge 2 commits intocypherstack:stagingfrom
reubenyap:fix/spark-anon-set-middle-path
Open

fix(firo): make Spark anon-set sync resumable and crash-safe#1298
reubenyap wants to merge 2 commits intocypherstack:stagingfrom
reubenyap:fix/spark-anon-set-middle-path

Conversation

@reubenyap
Copy link
Copy Markdown

Problem

runFetchAndUpdateSparkAnonSetCacheForGroupId accumulates every sector of a group's anonymity set in an in-memory Dart List and only writes to SQLite after the full delta has been fetched. A network drop, app kill, or power loss mid-download discards the in-memory buffer and leaves no persisted progress, so the next sync attempt recomputes the same meta.size - prevSize delta and re-downloads everything. For a large first-sync this burns the whole set repeatedly.

Additionally, a related design — persisting partial sectors directly to SparkSet with a complete flag but not filtering reader queries on it — has a second failure mode: a spend initiated while a sync is running can observe a half-populated anonymity set and hand libspark a setHash-mismatched set, failing the membership proof.

Fix

Per-sector commits against an in-progress SparkSet row that stays invisible to readers (complete = 0) until a strict integrity check passes, at which point a single atomic UPDATE flips it to complete = 1. Partial data is never observable by readers, and the next sync resumes from the count of already-linked coins.

Design

  • Schema: adds SparkSet.complete (integrity gate) and SparkSetCoins.orderKey (server-side delta index, for ordering preservation), plus a UNIQUE INDEX on SparkSetCoins(setId, coinId) so per-sector INSERT OR IGNORE dedupes correctly under crash-recovery replay.
  • Migration: idempotent, runs on every open, uses explicit PRAGMA table_info / sqlite_master presence checks rather than try/catch around ALTER (so unrelated SQLite errors don't get silently swallowed). Defensive dedup of any legacy duplicate (setId, coinId) rows before creating the UNIQUE index. Existing SparkSet rows default to complete = 1 (the pre-fix writer was all-or-nothing, so they represent finalized syncs); existing SparkSetCoins.orderKey defaults to 0 and the reader's ssc.id ASC tiebreaker reproduces the pre-fix layout byte-for-byte.
  • Writer: three focused functions — per-sector append, mark-complete (gated on COUNT(SparkSetCoins) == expectedDelta), and delete-incomplete (for blockHash shift / corruption reset). Each runs in its own SQLite transaction.
  • Reader: all anon-set queries gain WHERE ss.complete = 1. Coin ordering reconstructed via ORDER BY ss.id ASC, ssc.orderKey DESC, ssc.id ASC — matches pre-fix end-to-end behavior for both pre-migration rows (orderKey tied at 0, PK tiebreaker) and post-migration rows (orderKey meaningful).
  • Coordinator: resumable loop reads any in-progress row, verifies its blockHash and setHash match the current server meta, resumes at COUNT(SparkSetCoins) if they do, discards and restarts if they don't. Sector-size cross-check refuses to persist a partial/over-full server response. Reorg-shrink and blockHash-advanced-without-new-coins cases are handled explicitly to avoid corrupting finalized state.

Integrity guarantees

Scenario Behavior
Crash mid-sector SQLite transaction atomicity keeps cache consistent with last committed sector. Resume cursor derived from COUNT(SparkSetCoins).
Crash mid-finalize Either complete=1 flipped or not — no intermediate. Either way, next sync resumes correctly.
BlockHash shift between attempts In-progress row discarded; finalized rows untouched. Partial work re-downloaded at the new blockHash.
Server returns wrong sector size Abort before persisting. Cursor unchanged. Retry on next sync.
Server reports smaller size than cached (reorg) Skip sync, preserve cached state.
Server reports same blockHash with different size (anomaly) Skip sync, refuse to corrupt the already-finalized row.
Finalize integrity check mismatches Roll back, row stays complete=0. Next sync observes the over-linked row and resets.
Spend initiated during sync Reader's complete=1 filter keeps in-progress data invisible. Spend sees the last finalized state.

Upgrade path

Fully automatic — no manual cache clear, no wallet rescan, no user action. The migration runs on first open of the new version:

  1. ALTER TABLE SparkSet ADD COLUMN complete INTEGER NOT NULL DEFAULT 1 (legacy rows are finalized).
  2. ALTER TABLE SparkSetCoins ADD COLUMN orderKey INTEGER NOT NULL DEFAULT 0.
  3. Defensive dedup, then CREATE UNIQUE INDEX.

Verified against a realistic pre-migration DB (two finalized syncs, five link rows): reader output is byte-for-byte identical before and after migration; a subsequent post-migration incremental sync produces the expected hybrid layout.

Review strategy

The change is split into two commits with clear separation of concerns, so a reviewer can verify them independently:

1. 1d543d8 — schema + migration (+126 −5, 2 files)

Purely additive. Existing reader/writer/coordinator are untouched, so behavior for anyone running only this commit is unchanged. Verify the schema design and migration correctness in isolation before looking at any logic.

2. 90cd308 — behavioral fix (+424 −77, 4 files)

Writer, worker, reader filter/ordering, and coordinator rewrite. Public FiroCacheCoordinator signatures are preserved, so no callers in firo_wallet.dart, spark_interface.dart, or UI views need changes.

Test plan

Environment didn't have the Flutter/Dart toolchain available during development, so I verified by replaying the writer's exact SQL against an in-memory sqlite3 CLI instance across 18 scenarios:

  • First sync from scratch, single sector
  • Incremental sync (finalized + delta)
  • Crash mid-download, resume from committed cursor
  • BlockHash shift between resume attempts → discard in-progress, restart
  • Integrity gate rejects under-count finalize → row stays hidden
  • Idempotent per-sector replay → no duplicates
  • Cross-set coin reuse → reader filters correctly, no duplicates
  • End-to-end coin ordering matches pre-fix byte-for-byte
  • Pre-migration rows (orderKey=0 tied) read correctly via PK tiebreaker
  • Migration is idempotent (runs safely on every open)
  • Duplicate (setId, coinId) dedup keeps MIN(id) per group before CREATE UNIQUE INDEX
  • Same-blockHash/different-size anomaly skipped
  • Writer refuses to append to already-finalized row
  • Empty-delta (blockHash advanced without new coins) skipped without writing same-size row
  • Reorg-style shrink skipped
  • External callers (firo_wallet.dart, spark_interface.dart, UI) use unchanged public signatures
  • Needs reviewer to run: flutter analyze / dart analyze pass
  • Needs reviewer to run: existing test suites still pass

🤖 Generated with Claude Code

@reubenyap
Copy link
Copy Markdown
Author

Self-review — findings after cold adversarial read

Ran a second pass with a skeptical Dart/SQLite reviewer agent to catch things I missed. Below are all findings from both passes, triaged by actionability. TL;DR: no bugs found that this PR needs to fix before merge. Details:


False positives (analyzed and dismissed)

Claim: INSERT OR IGNORE on SparkSetCoins silently drops duplicates, wedging the sync.

The scenario requires the server to return the same coin at two different indices for a given blockHash. The Firo daemon's GetCoinsForRecovery in src/spark/state.cpp:1617-1674 walks blocks backward from the target blockHash and iterates block->sparkMintedCoins[id] (a vector of coins minted in that block). A counter variable is incremented once per coin, and a coin can only appear in one block's sparkMintedCoins (a mint has exactly one confirmation block). So within any single sector request, and across sectors of the same sync (which all use the same pinned blockHash), each coin has exactly one counter value — no duplicates.

For cross-sync consistency: the blockHash argument pins the server's index mapping (the walk starts by scrolling back from coinGroup.lastBlock to blockHash, then indexes from there). A reorg that invalidates our blockHash causes the server to throw "Incorrect blockHash provided", which our coordinator surfaces as a sync failure; on next sync the blockHash mismatch is detected and the in-progress row is discarded. No silent data mixing.

The INSERT OR IGNORE on SparkSetCoins is therefore belt-and-suspenders; switching to plain INSERT would fail loudly in the impossible case, but would also add risk (new failure modes from server hiccups that currently dedupe silently). Not changing.

Claim: Redundant coinGroupID assignment.

spark_interface.dart:608 and :614 both set setData["coinGroupID"] = i. Pre-existing, not touched by this PR. Worth cleaning up separately.

Claim: Reader ordering is only newest-first within a set, not across sets.

Traced end-to-end with the delta-storage model. For sets 1 (size 100 at H1) and 2 (delta 50 at H2=150):

  • Set 1 orderKeys 0..99 = H1-indices 0..99 = at H2 these map to indices 50..149.
  • Set 2 orderKeys 0..49 = H2-indices 0..49.
  • Reader emits [set1 orderKey 99..0, set2 orderKey 49..0] → H2-indices [149..50, 49..0].
  • Dart .reversed → H2-indices [0..49, 50..149] = server's newest-first at H2.

End-to-end order is correct across sets. The comment in the reader accurately describes the behavior.


Pre-existing issues not introduced by this PR

Concurrency: main-isolate reads vs worker-isolate writes on the same SQLite file.

The sqlite3 package opens in rollback-journal mode by default; two handles on the same file can contend on the database-level lock. In my sync flow, main-isolate reads happen before the first worker task is dispatched, so the sync itself doesn't contend. But UI-initiated reads (getSparkCacheSize, getSetCoinsForGroupId from a spend) can race with a worker write and hit SQLITE_BUSY. This is pre-existing and my per-sector writes increase the exposure window (more, shorter transactions rather than one long one). A follow-up that enables PRAGMA journal_mode=WAL on open would resolve this broadly — worth considering in a separate PR.

TOCTOU between getSetCoinsForGroupId and getLatestSetInfoForGroupId in spark_interface.dart.

spark_interface.dart:589 then :597 reads coins, then reads meta. Between the two awaits a concurrent sync could finalize a new SparkSet row. The read coins would be from the older set but info.setHash would be from the newer set, producing a hash mismatch in libspark. Pre-existing race. Worth fixing in a follow-up (either guard with the set lock or re-read consistency).

PRAGMA foreign_keys is not explicitly enabled.

The FK declarations on SparkSetCoins are decorative in rollback-journal mode with default pragmas. Pre-existing. _deleteIncompleteSparkSetsForGroup and _deleteAllCache both delete in the correct order (children before parents), so this doesn't bite today, but a future writer that deletes SparkSet without first clearing SparkSetCoins would leave dangling references silently.


Notes worth documenting but not changing

Progress-bar callback semantic inconsistency.

Initial call passes (prevSize, meta.size) — absolute ratio. Loop calls pass (cursor, numberOfCoinsToFetch) — delta-relative ratio. For a resumed sync with prevSize=0 and in-progress linked=1200 of 1500, the UI will display 0% initially and then jump to 100% after the first remaining sector. Pre-existing inconsistency in the callback shape; my PR preserves the shape to avoid changing caller expectations. A cleaner fix is (prevSize + linkedSoFar, meta.size) on the initial call, but that changes semantics.

DEFAULT 1 asymmetry between migration and fresh-DB CREATE TABLE.

Migrated DBs get complete INTEGER NOT NULL DEFAULT 1 (legacy rows are finalized). Fresh DBs get complete INTEGER NOT NULL DEFAULT 0. Any future writer that omits the complete column on INSERT will behave differently between the two paths. My writer always specifies complete = 0 explicitly, so this isn't a live hazard — but a footgun for future code.

Orphan SparkCoin rows on blockHash shift.

_deleteIncompleteSparkSetsForGroup cleans SparkSet and SparkSetCoins but leaves the underlying SparkCoin rows. On a subsequent sync at a new blockHash those coins are either re-linked (if still in the delta window, which they always are for normal sync patterns) or remain orphaned. Bounded disk-space impact. Could be swept by a GC pass.

_getIncompleteSetForGroupId LIMIT 1 masks multiple-incomplete-rows.

In pathological scenarios (multiple aborted syncs at different blockHashes) several incomplete rows could coexist. My query returns ORDER BY id DESC LIMIT 1 (newest) deterministically. The coordinator discards all mismatching incomplete rows via _deleteIncompleteSparkSetsForGroup when it sees a mismatch, so orphans are cleaned up on the next mismatched sync. Not an indefinite leak.


Testing asks for the reviewer

Three things this environment couldn't run:

  1. flutter analyze / dart analyze — to catch any Dart-level issues I missed.
  2. Existing test suite in test/ — none touch firo_cache_*, but the wallet-level tests may exercise the sync path.
  3. End-to-end regression test on a real testnet: sync, kill mid-download, restart, confirm resume. I simulated this in sqlite3 but a live run would exercise the isolate + electrumx + libspark integration.

Happy to address any of the above findings (e.g., enabling WAL, tightening the spark_interface race) in this PR or as follow-ups — let me know the preference.

🤖 Self-review generated with Claude Code

@reubenyap reubenyap force-pushed the fix/spark-anon-set-middle-path branch from 748e6a3 to 016fb4b Compare April 27, 2026 16:15
@reubenyap reubenyap force-pushed the fix/spark-anon-set-middle-path branch from 016fb4b to bb300cd Compare April 27, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant