Skip to content

[core] Support snapshot-based sequence ordering for primary-key tables#7832

Open
JunRuiLee wants to merge 9 commits into
apache:masterfrom
JunRuiLee:snapshot-ordering-v2
Open

[core] Support snapshot-based sequence ordering for primary-key tables#7832
JunRuiLee wants to merge 9 commits into
apache:masterfrom
JunRuiLee:snapshot-ordering-v2

Conversation

@JunRuiLee
Copy link
Copy Markdown
Contributor

Purpose

close #7806

Tests

  • SchemaValidationTest#testSnapshotSequenceOrderingHappyPath
  • SchemaValidationTest#testSnapshotSequenceOrderingRejectsSequenceField
  • SchemaValidationTest#testSnapshotSequenceOrderingRejectsNonPkTable
  • KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithSnapshotId
  • KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithoutSnapshotId
  • SortMergeSnapshotOrderingTest#testLaterSnapshotWinsOverHigherSequence
  • SortMergeSnapshotOrderingTest#testFallsBackToSequenceWhenSnapshotMissing
  • SortMergeSnapshotOrderingTest#testSameSnapshotFallsBackToSequence
  • SortMergeSnapshotOrderingTest#testStampedAlwaysBeatsUnstamped
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrdering
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingFallsBackToSequenceWithinSnapshot
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingCompactionPreservesInputSnapshotId
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogInput
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogLookup
  • PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingDeleteFromLaterSnapshot

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 36b0eaf to 2c737da Compare May 13, 2026 02:35
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Hi @JingsongLi, could you help take a look? Many thanks.

Comment thread paimon-core/src/main/java/org/apache/paimon/io/KeyValueDataFileRecordReader.java Outdated
Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated
Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Thanks @leaves12138 for the review! Fixed the compaction ordering issue by persisting per-record snapshotId through _SEQUENCE_NUMBER column. Added tests for the scenario you described. Old constructor removed.

PTAL, Thanks!

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 09ba5c9 to 37cc344 Compare May 14, 2026 07:35
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I took another careful pass over the snapshot-ordering implementation. I think there are still a few correctness issues to address before this can be safely merged.

Comment thread paimon-api/src/main/java/org/apache/paimon/CoreOptions.java
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Thanks @leaves12138 for the careful review.

I fixed the first two correctness issues:

  1. PartialUpdateMergeFunction and AggregateMergeFunction now preserve the winning input record’s snapshotId when returning a newly built KeyValue, so compaction no longer stamps UNKNOWN_SNAPSHOT_ID into
    _SEQUENCE_NUMBER.
  2. KeyValueBuffer now preserves snapshotId when snapshot ordering is enabled, so lookup compaction buffer spill does not lose it during binary serialization/deserialization.

I also added regression coverage for merge-function snapshotId preservation, partial-update compaction, aggregate compaction, and lookup buffer spill.

For the ALTER TABLE concern: this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed.

Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One remaining edge case around partial-update snapshot ordering.

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 973e270 to 1b503a7 Compare May 18, 2026 11:46
@leaves12138
Copy link
Copy Markdown
Contributor

leaves12138 commented May 19, 2026

This feature is more complex than it is expected. I can't figure out more corner case, but there must exist some cases we have not catch. This looks good to me, but it need more experts to review this feature. @JingsongLi

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 18c93e8 to 9e16681 Compare May 19, 2026 04:59
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

@leaves12138 Thanks for the review! I've reorganized the commits to make the progression clearer.

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch 3 times, most recently from b38713d to 04192f4 Compare May 22, 2026 14:24
@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Hi @JingsongLi, could you please take a look? Thanks so much.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: snapshot-based sequence ordering for primary-key tables

Well-structured feature with comprehensive test coverage. A few observations and potential issues:

Design

  1. Repurposing minSequenceNumber/maxSequenceNumber in DataFileMeta -- Clever hack that avoids a schema evolution in the manifest format, but it makes these fields semantically overloaded. Any existing or future code that reads minSequenceNumber() with its original semantics (e.g., snapshot expiration, incremental reads) could silently misinterpret the value when snapshot-ordering is enabled. The Javadoc addition helps, but consider adding a helper like DataFileMeta.commitSnapshotId(CoreOptions) to make call-sites self-documenting and less error-prone.

  2. Threading boolean snapshotSequenceOrdering through ~10 constructors -- This increases coupling and makes signature changes expensive. A lighter alternative would be to encapsulate the ordering strategy in a Comparator<KeyValue> factory (or reuse/extend UserDefinedSeqComparator) created once in MergeTreeCompactManagerFactory and passed downstream, rather than a bare boolean. Not blocking, but worth considering for maintainability.

Correctness

  1. Loss of within-snapshot secondary ordering after compaction -- In MergeTreeCompactRewriter.stampSequenceWithSnapshotId, you overwrite per-record sequenceNumber with snapshotId. After a compacted file is read back (recoverSnapshotIdFromSequence = true), sequenceNumber still holds the snapshotId value (line ~289 of KeyValueDataFileRecordReader), so the original within-snapshot sequence is lost. This is safe today because compaction already resolved same-key conflicts, but if a future merge path re-merges records from the same snapshot across compacted and un-compacted files with different keys sharing a snapshot, the tiebreaker is effectively random (all records show the same "sequence"). Worth documenting this invariant explicitly.

  2. PartialUpdateMergeFunction snapshot propagation with field-level sequence groups -- latestSnapshotId always tracks the globally-latest input's snapshotId, but when fieldSeqComparators is non-empty, individual fields may be retained from older inputs. The result's snapshotId could overstate recency for some columns. This is analogous to the existing latestSequenceNumber behavior so it is consistent, but worth a comment acknowledging the limitation.

  3. FileSource absent case -- In KeyValueFileReaderFactory.createRecordReader:

    boolean recoverSnapshotIdFromSequence =
        snapshotSequenceOrdering
            && file.fileSource().isPresent()
            && file.fileSource().get() == FileSource.COMPACT;

    If fileSource() is empty (e.g., files from older table versions before file-source tracking was added), the code falls through to the uniform file-level stamp path. If those legacy files were written before snapshot-ordering was enabled, minSequenceNumber still carries the original sequence range, not a snapshot ID. The reader would then interpret that as a snapshotId, which could cause incorrect merge ordering. Is there a guard against enabling this option on a table that already has existing data files without file-source metadata?

Nits

  1. In SortMergeReaderWithMinHeap, the snapshot comparison is placed inside the else branch of if (userDefinedSeqComparator != null). This means snapshotSequenceOrdering is silently ignored when a user-defined seq comparator is present. The schema validation rejects sequence.field + snapshot-ordering together, so it should be unreachable, but an assert or early-fail would make this safer against future relaxation of validation.

  2. The SchemaValidation overload validateTableSchema(schema, dynamicOptionKeys) is public with no Javadoc. Since it gates security-sensitive behavior (allowing write-only bypass for compact jobs), documenting its contract would help future maintainers.

Testing

The test matrix is thorough -- covering deduplicate, partial-update, aggregate, changelog-input, changelog-lookup, multi-round compaction, and ordering reversal scenarios. The MergeFunctionSnapshotIdTest is a great regression safety net. One gap: there is no test with SortEngine.LOSER_TREE specifically exercising the snapshot ordering path in an integration test (though SortMergeSnapshotOrderingTest covers both engines at the unit level via @EnumSource).

Overall: solid implementation for the targeted use case. The main risk area is the semantic overloading of minSequenceNumber/maxSequenceNumber and potential interaction with legacy files.

@JunRuiLee
Copy link
Copy Markdown
Contributor Author

Thanks @JingsongLi for the detailed review.

I addressed the actionable safety/documentation/test items in this update:

  1. Added documentation for snapshot-ordering compaction: compacted records intentionally store snapshot id in the sequence field, so within-snapshot secondary ordering is not preserved after compaction. This is
    safe under the current invariant because compaction has already resolved same-key conflicts.
  2. Added a comment in PartialUpdateMergeFunction for field-level sequence groups: the output row has one row-level snapshot id, which may overstate recency for fields retained from older inputs. This matches
    the existing latestSequenceNumber behavior.
  3. Added a fail-fast guard when snapshot-ordering reads a file without fileSource metadata. sequence.snapshot-ordering is immutable and should only be enabled for newly-created/empty tables, so legacy files
    should not reach this path.
  4. Added defensive checks that sequence.snapshot-ordering is not used together with a user-defined sequence comparator near the sort/lookup comparator creation paths.
  5. Added Javadoc for validateTableSchema(schema, dynamicOptionKeys) to document the dynamic write-only override contract for dedicated compaction jobs.
  6. The existing integration tests use the default LOSER_TREE; I added a MIN_HEAP variant so both sort engines are covered.

I did not include the broader comparator-strategy refactor in this PR because it would touch several merge paths and make the correctness-focused review larger. I added fail-fast checks and documentation around the current paths instead, and I think the strategy refactor is better handled as a follow-up cleanup.

@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 3c49861 to 9f30b95 Compare May 24, 2026 02:25
JunRuiLee added 8 commits May 24, 2026 10:46
Snapshot ordering relies on snapshot id to determine record order,
but inline compaction happens before snapshot creation — files have
not been stamped with the correct snapshot id yet. Compaction would
propagate incorrect values to KV records.

Enforce write-only=true at schema validation time so that writers
use NoopCompactManager. Dedicated compact jobs override write-only
at runtime via table.copy(), which passes dynamicOptionKeys to skip
this specific check.
…tation for snapshot ordering

Add fail-fast guard for legacy files without fileSource metadata,
defensive assertions for mutually exclusive config options, Javadoc
for within-snapshot ordering loss and SchemaValidation overload,
and MIN_HEAP sort engine integration test coverage.
@JunRuiLee JunRuiLee force-pushed the snapshot-ordering-v2 branch from 9f30b95 to c88356e Compare May 24, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support snapshot-based sequence ordering for primary-key tables

3 participants