[core] Support snapshot-based sequence ordering for primary-key tables#7832
[core] Support snapshot-based sequence ordering for primary-key tables#7832JunRuiLee wants to merge 9 commits into
Conversation
36b0eaf to
2c737da
Compare
|
Hi @JingsongLi, could you help take a look? Many thanks. |
|
Thanks @leaves12138 for the review! Fixed the compaction ordering issue by persisting per-record snapshotId through _SEQUENCE_NUMBER column. Added tests for the scenario you described. Old constructor removed. PTAL, Thanks! |
09ba5c9 to
37cc344
Compare
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the update. I took another careful pass over the snapshot-ordering implementation. I think there are still a few correctness issues to address before this can be safely merged.
|
Thanks @leaves12138 for the careful review. I fixed the first two correctness issues:
I also added regression coverage for merge-function snapshotId preservation, partial-update compaction, aggregate compaction, and lookup buffer spill. For the ALTER TABLE concern: this option is annotated as immutable, so enabling it via ALTER on a table with existing snapshots is rejected; empty-table ALTER remains allowed. |
leaves12138
left a comment
There was a problem hiding this comment.
One remaining edge case around partial-update snapshot ordering.
973e270 to
1b503a7
Compare
|
This feature is more complex than it is expected. I can't figure out more corner case, but there must exist some cases we have not catch. This looks good to me, but it need more experts to review this feature. @JingsongLi |
18c93e8 to
9e16681
Compare
|
@leaves12138 Thanks for the review! I've reorganized the commits to make the progression clearer. |
b38713d to
04192f4
Compare
|
Hi @JingsongLi, could you please take a look? Thanks so much. |
JingsongLi
left a comment
There was a problem hiding this comment.
Review: snapshot-based sequence ordering for primary-key tables
Well-structured feature with comprehensive test coverage. A few observations and potential issues:
Design
-
Repurposing
minSequenceNumber/maxSequenceNumberinDataFileMeta-- Clever hack that avoids a schema evolution in the manifest format, but it makes these fields semantically overloaded. Any existing or future code that readsminSequenceNumber()with its original semantics (e.g., snapshot expiration, incremental reads) could silently misinterpret the value when snapshot-ordering is enabled. The Javadoc addition helps, but consider adding a helper likeDataFileMeta.commitSnapshotId(CoreOptions)to make call-sites self-documenting and less error-prone. -
Threading
boolean snapshotSequenceOrderingthrough ~10 constructors -- This increases coupling and makes signature changes expensive. A lighter alternative would be to encapsulate the ordering strategy in aComparator<KeyValue>factory (or reuse/extendUserDefinedSeqComparator) created once inMergeTreeCompactManagerFactoryand passed downstream, rather than a bare boolean. Not blocking, but worth considering for maintainability.
Correctness
-
Loss of within-snapshot secondary ordering after compaction -- In
MergeTreeCompactRewriter.stampSequenceWithSnapshotId, you overwrite per-recordsequenceNumberwithsnapshotId. After a compacted file is read back (recoverSnapshotIdFromSequence = true),sequenceNumberstill holds the snapshotId value (line ~289 ofKeyValueDataFileRecordReader), so the original within-snapshot sequence is lost. This is safe today because compaction already resolved same-key conflicts, but if a future merge path re-merges records from the same snapshot across compacted and un-compacted files with different keys sharing a snapshot, the tiebreaker is effectively random (all records show the same "sequence"). Worth documenting this invariant explicitly. -
PartialUpdateMergeFunctionsnapshot propagation with field-level sequence groups --latestSnapshotIdalways tracks the globally-latest input's snapshotId, but whenfieldSeqComparatorsis non-empty, individual fields may be retained from older inputs. The result's snapshotId could overstate recency for some columns. This is analogous to the existinglatestSequenceNumberbehavior so it is consistent, but worth a comment acknowledging the limitation. -
FileSourceabsent case -- InKeyValueFileReaderFactory.createRecordReader:boolean recoverSnapshotIdFromSequence = snapshotSequenceOrdering && file.fileSource().isPresent() && file.fileSource().get() == FileSource.COMPACT;
If
fileSource()is empty (e.g., files from older table versions before file-source tracking was added), the code falls through to the uniform file-level stamp path. If those legacy files were written before snapshot-ordering was enabled,minSequenceNumberstill carries the original sequence range, not a snapshot ID. The reader would then interpret that as a snapshotId, which could cause incorrect merge ordering. Is there a guard against enabling this option on a table that already has existing data files without file-source metadata?
Nits
-
In
SortMergeReaderWithMinHeap, the snapshot comparison is placed inside theelsebranch ofif (userDefinedSeqComparator != null). This meanssnapshotSequenceOrderingis silently ignored when a user-defined seq comparator is present. The schema validation rejectssequence.field+snapshot-orderingtogether, so it should be unreachable, but anassertor early-fail would make this safer against future relaxation of validation. -
The
SchemaValidationoverloadvalidateTableSchema(schema, dynamicOptionKeys)is public with no Javadoc. Since it gates security-sensitive behavior (allowing write-only bypass for compact jobs), documenting its contract would help future maintainers.
Testing
The test matrix is thorough -- covering deduplicate, partial-update, aggregate, changelog-input, changelog-lookup, multi-round compaction, and ordering reversal scenarios. The MergeFunctionSnapshotIdTest is a great regression safety net. One gap: there is no test with SortEngine.LOSER_TREE specifically exercising the snapshot ordering path in an integration test (though SortMergeSnapshotOrderingTest covers both engines at the unit level via @EnumSource).
Overall: solid implementation for the targeted use case. The main risk area is the semantic overloading of minSequenceNumber/maxSequenceNumber and potential interaction with legacy files.
|
Thanks @JingsongLi for the detailed review. I addressed the actionable safety/documentation/test items in this update:
I did not include the broader comparator-strategy refactor in this PR because it would touch several merge paths and make the correctness-focused review larger. I added fail-fast checks and documentation around the current paths instead, and I think the strategy refactor is better handled as a follow-up cleanup. |
3c49861 to
9f30b95
Compare
Snapshot ordering relies on snapshot id to determine record order, but inline compaction happens before snapshot creation — files have not been stamped with the correct snapshot id yet. Compaction would propagate incorrect values to KV records. Enforce write-only=true at schema validation time so that writers use NoopCompactManager. Dedicated compact jobs override write-only at runtime via table.copy(), which passes dynamicOptionKeys to skip this specific check.
…tation for snapshot ordering Add fail-fast guard for legacy files without fileSource metadata, defensive assertions for mutually exclusive config options, Javadoc for within-snapshot ordering loss and SchemaValidation overload, and MIN_HEAP sort engine integration test coverage.
9f30b95 to
c88356e
Compare
Purpose
close #7806
Tests