[core] Support snapshot-based sequence ordering for primary-key tables by JunRuiLee · Pull Request #7832 · apache/paimon

JunRuiLee · 2026-05-12T15:06:20Z

Purpose

Tests

SchemaValidationTest#testSnapshotSequenceOrderingHappyPath
SchemaValidationTest#testSnapshotSequenceOrderingRejectsSequenceField
SchemaValidationTest#testSnapshotSequenceOrderingRejectsNonPkTable
KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithSnapshotId
KeyValueWithLevelNoReusingSerializerSnapshotIdTest#testRoundTripWithoutSnapshotId
SortMergeSnapshotOrderingTest#testLaterSnapshotWinsOverHigherSequence
SortMergeSnapshotOrderingTest#testFallsBackToSequenceWhenSnapshotMissing
SortMergeSnapshotOrderingTest#testSameSnapshotFallsBackToSequence
SortMergeSnapshotOrderingTest#testStampedAlwaysBeatsUnstamped
PrimaryKeySimpleTableTest#testSnapshotSequenceOrdering
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingFallsBackToSequenceWithinSnapshot
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingCompactionPreservesInputSnapshotId
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogInput
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingWithChangelogLookup
PrimaryKeySimpleTableTest#testSnapshotSequenceOrderingDeleteFromLaterSnapshot

JunRuiLee · 2026-05-13T03:46:53Z

Hi @JingsongLi, could you help take a look? Many thanks.

JunRuiLee · 2026-05-14T07:07:58Z

Thanks @leaves12138 for the review! Fixed the compaction ordering issue by persisting per-record snapshotId through _SEQUENCE_NUMBER column. Added tests for the scenario you described. Old constructor removed.

PTAL, Thanks!

leaves12138

Thanks for the update. I took another careful pass over the snapshot-ordering implementation. I think there are still a few correctness issues to address before this can be safely merged.

JunRuiLee · 2026-05-18T07:13:52Z

Thanks @leaves12138 for the careful review.

I fixed the first two correctness issues:

PartialUpdateMergeFunction and AggregateMergeFunction now preserve the winning input record’s snapshotId when returning a newly built KeyValue, so compaction no longer stamps UNKNOWN_SNAPSHOT_ID into
_SEQUENCE_NUMBER.
KeyValueBuffer now preserves snapshotId when snapshot ordering is enabled, so lookup compaction buffer spill does not lose it during binary serialization/deserialization.

I also added regression coverage for merge-function snapshotId preservation, partial-update compaction, aggregate compaction, and lookup buffer spill.

For the ALTER TABLE concern: this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed.

leaves12138

One remaining edge case around partial-update snapshot ordering.

leaves12138 · 2026-05-19T02:18:09Z

This feature is more complex than it is expected. I can't figure out more corner case, but there must exist some cases we have not catch. This looks good to me, but it need more experts to review this feature. @JingsongLi

JunRuiLee · 2026-05-19T05:00:43Z

@leaves12138 Thanks for the review! I've reorganized the commits to make the progression clearer.

JunRuiLee · 2026-05-22T14:39:45Z

Hi @JingsongLi, could you please take a look? Thanks so much.

JingsongLi

Review: snapshot-based sequence ordering for primary-key tables

Well-structured feature with comprehensive test coverage. A few observations and potential issues:

Design

Repurposing minSequenceNumber/maxSequenceNumber in DataFileMeta -- Clever hack that avoids a schema evolution in the manifest format, but it makes these fields semantically overloaded. Any existing or future code that reads minSequenceNumber() with its original semantics (e.g., snapshot expiration, incremental reads) could silently misinterpret the value when snapshot-ordering is enabled. The Javadoc addition helps, but consider adding a helper like DataFileMeta.commitSnapshotId(CoreOptions) to make call-sites self-documenting and less error-prone.
Threading boolean snapshotSequenceOrdering through ~10 constructors -- This increases coupling and makes signature changes expensive. A lighter alternative would be to encapsulate the ordering strategy in a Comparator<KeyValue> factory (or reuse/extend UserDefinedSeqComparator) created once in MergeTreeCompactManagerFactory and passed downstream, rather than a bare boolean. Not blocking, but worth considering for maintainability.

Correctness

Loss of within-snapshot secondary ordering after compaction -- In MergeTreeCompactRewriter.stampSequenceWithSnapshotId, you overwrite per-record sequenceNumber with snapshotId. After a compacted file is read back (recoverSnapshotIdFromSequence = true), sequenceNumber still holds the snapshotId value (line ~289 of KeyValueDataFileRecordReader), so the original within-snapshot sequence is lost. This is safe today because compaction already resolved same-key conflicts, but if a future merge path re-merges records from the same snapshot across compacted and un-compacted files with different keys sharing a snapshot, the tiebreaker is effectively random (all records show the same "sequence"). Worth documenting this invariant explicitly.
PartialUpdateMergeFunction snapshot propagation with field-level sequence groups -- latestSnapshotId always tracks the globally-latest input's snapshotId, but when fieldSeqComparators is non-empty, individual fields may be retained from older inputs. The result's snapshotId could overstate recency for some columns. This is analogous to the existing latestSequenceNumber behavior so it is consistent, but worth a comment acknowledging the limitation.
FileSource absent case -- In KeyValueFileReaderFactory.createRecordReader:
```
boolean recoverSnapshotIdFromSequence =
    snapshotSequenceOrdering
        && file.fileSource().isPresent()
        && file.fileSource().get() == FileSource.COMPACT;
```
If fileSource() is empty (e.g., files from older table versions before file-source tracking was added), the code falls through to the uniform file-level stamp path. If those legacy files were written before snapshot-ordering was enabled, minSequenceNumber still carries the original sequence range, not a snapshot ID. The reader would then interpret that as a snapshotId, which could cause incorrect merge ordering. Is there a guard against enabling this option on a table that already has existing data files without file-source metadata?

Nits

In SortMergeReaderWithMinHeap, the snapshot comparison is placed inside the else branch of if (userDefinedSeqComparator != null). This means snapshotSequenceOrdering is silently ignored when a user-defined seq comparator is present. The schema validation rejects sequence.field + snapshot-ordering together, so it should be unreachable, but an assert or early-fail would make this safer against future relaxation of validation.
The SchemaValidation overload validateTableSchema(schema, dynamicOptionKeys) is public with no Javadoc. Since it gates security-sensitive behavior (allowing write-only bypass for compact jobs), documenting its contract would help future maintainers.

Testing

The test matrix is thorough -- covering deduplicate, partial-update, aggregate, changelog-input, changelog-lookup, multi-round compaction, and ordering reversal scenarios. The MergeFunctionSnapshotIdTest is a great regression safety net. One gap: there is no test with SortEngine.LOSER_TREE specifically exercising the snapshot ordering path in an integration test (though SortMergeSnapshotOrderingTest covers both engines at the unit level via @EnumSource).

Overall: solid implementation for the targeted use case. The main risk area is the semantic overloading of minSequenceNumber/maxSequenceNumber and potential interaction with legacy files.

JunRuiLee · 2026-05-23T18:13:55Z

Thanks @JingsongLi for the detailed review.

I addressed the actionable safety/documentation/test items in this update:

Added documentation for snapshot-ordering compaction: compacted records intentionally store snapshot id in the sequence field, so within-snapshot secondary ordering is not preserved after compaction. This is
safe under the current invariant because compaction has already resolved same-key conflicts.
Added a comment in PartialUpdateMergeFunction for field-level sequence groups: the output row has one row-level snapshot id, which may overstate recency for fields retained from older inputs. This matches
the existing latestSequenceNumber behavior.
Added a fail-fast guard when snapshot-ordering reads a file without fileSource metadata. sequence.snapshot-ordering is immutable and should only be enabled for newly-created/empty tables, so legacy files
should not reach this path.
Added defensive checks that sequence.snapshot-ordering is not used together with a user-defined sequence comparator near the sort/lookup comparator creation paths.
Added Javadoc for validateTableSchema(schema, dynamicOptionKeys) to document the dynamic write-only override contract for dedicated compaction jobs.
The existing integration tests use the default LOSER_TREE; I added a MIN_HEAP variant so both sort engines are covered.

I did not include the broader comparator-strategy refactor in this PR because it would touch several merge paths and make the correctness-focused review larger. I added fail-fast checks and documentation around the current paths instead, and I think the strategy refactor is better handled as a follow-up cleanup.

…lidation

Snapshot ordering relies on snapshot id to determine record order, but inline compaction happens before snapshot creation — files have not been stamped with the correct snapshot id yet. Compaction would propagate incorrect values to KV records. Enforce write-only=true at schema validation time so that writers use NoopCompactManager. Dedicated compact jobs override write-only at runtime via table.copy(), which passes dynamicOptionKeys to skip this specific check.

…ders

…tation for snapshot ordering Add fail-fast guard for legacy files without fileSource metadata, defensive assertions for mutually exclusive config options, Javadoc for within-snapshot ordering loss and SchemaValidation overload, and MIN_HEAP sort engine integration test coverage.

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 36b0eaf to 2c737da Compare May 13, 2026 02:35

leaves12138 reviewed May 14, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/io/KeyValueDataFileRecordReader.java Outdated

leaves12138 reviewed May 14, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated

leaves12138 reviewed May 14, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java Outdated

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 09ba5c9 to 37cc344 Compare May 14, 2026 07:35

leaves12138 requested changes May 18, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/schema/SchemaValidation.java

Comment thread ...n-core/src/main/java/org/apache/paimon/mergetree/compact/LookupMergeTreeCompactRewriter.java

Comment thread paimon-api/src/main/java/org/apache/paimon/CoreOptions.java

leaves12138 reviewed May 18, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/mergetree/compact/PartialUpdateMergeFunction.java

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 973e270 to 1b503a7 Compare May 18, 2026 11:46

leaves12138 reviewed May 18, 2026

View reviewed changes

Comment thread paimon-core/src/main/java/org/apache/paimon/operation/FileStoreCommitImpl.java

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 18c93e8 to 9e16681 Compare May 19, 2026 04:59

JunRuiLee force-pushed the snapshot-ordering-v2 branch 3 times, most recently from b38713d to 04192f4 Compare May 22, 2026 14:24

JingsongLi reviewed May 23, 2026

View reviewed changes

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 3c49861 to 9f30b95 Compare May 24, 2026 02:25

JunRuiLee added 8 commits May 24, 2026 10:46

[core] Add sequence.snapshot-ordering option with mutual-exclusion va…

ef32b1e

…lidation

[core] Add snapshotId on KeyValue for snapshot-based merge ordering

7ed0b2c

[core] Use commit snapshot id as primary tiebreaker in sort-merge rea…

869a94d

…ders

[core] Stamp snapshotId on read and assign at commit time

2e62c22

[core] Preserve snapshotId through merge functions, spill and compaction

7c0ce23

[core] End-to-end tests for snapshot-based sequence ordering

a97f034

[docs] Add snapshot ordering documentation.

fb5ac2d

JunRuiLee force-pushed the snapshot-ordering-v2 branch from 9f30b95 to c88356e Compare May 24, 2026 02:50

Conversation

JunRuiLee commented May 12, 2026

Purpose

Tests

Uh oh!

JunRuiLee commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JunRuiLee commented May 14, 2026

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JunRuiLee commented May 18, 2026

Uh oh!

leaves12138 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leaves12138 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JunRuiLee commented May 19, 2026

Uh oh!

JunRuiLee commented May 22, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review: snapshot-based sequence ordering for primary-key tables

Design

Correctness

Nits

Testing

Uh oh!

JunRuiLee commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leaves12138 commented May 19, 2026 •

edited

Loading