Skip to content

[PoC] Implement Parquet GH-583 INT96 timestamp ColumnOrder#10106

Draft
etseidl wants to merge 51 commits into
apache:mainfrom
etseidl:int96_order
Draft

[PoC] Implement Parquet GH-583 INT96 timestamp ColumnOrder#10106
etseidl wants to merge 51 commits into
apache:mainfrom
etseidl:int96_order

Conversation

@etseidl

@etseidl etseidl commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Spark continues to use INT96 timestamps, despite INT96 being marked as deprecated in 2018. Query engines want valid statistics to allow reliably pruning on INT96 columns. apache/parquet-format#584 adds a new ColumnOrder variant which can be used to signal compliance with the only known use of INT96 (4-byte julian day from epoch, 8-byte nanosecond).

What changes are included in this PR?

Adds support for the new enum variant, and writes the appropriate value in the FileMetaData.column_orders field.

This builds on changes introduced in #7687.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, this adds a new variant to public enums (ColumnOrder::INT96_TIMESTAMP_ORDER, SortOrder::INT96_TIMESTAMP).

@etseidl

etseidl commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Note: this includes changes from #9619 and #10104, so it cannot be reviewed nor merged until those PRs are merged.

@etseidl etseidl added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate labels Jun 10, 2026
@divjotarora

Copy link
Copy Markdown

@etseidl Can we ask someone to review this PR? I'd like to start a vote on the INT96 changes soon, but we should get some eyes on this PR before that.

@etseidl

etseidl commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

@etseidl Can we ask someone to review this PR? I'd like to start a vote on the INT96 changes soon, but we should get some eyes on this PR before that.

@divjotarora I don't think it's strictly necessary to have this reviewed before the spec change. More important would be to show that the PoCs can read each others files. Ideally we'd add a file to parquet-testing with the new sort order specified and make sure the new readers correctly get the stats, and old readers hopefully ignore the stats, but at least don't die when trying to read the file.

If anyone (@alamb or @jhorstmann perhaps) has time for a quick look, the changes for this PR can be viewed here (diff against the branch from #10104).

@jhorstmann

Copy link
Copy Markdown
Contributor

Thank you for the link to just the INT96 changes, they look good to me.

@divjotarora

Copy link
Copy Markdown

@etseidl Can we ask someone to review this PR? I'd like to start a vote on the INT96 changes soon, but we should get some eyes on this PR before that.

@divjotarora I don't think it's strictly necessary to have this reviewed before the spec change. More important would be to show that the PoCs can read each others files. Ideally we'd add a file to parquet-testing with the new sort order specified and make sure the new readers correctly get the stats, and old readers hopefully ignore the stats, but at least don't die when trying to read the file.

@etseidl Great point about parquet-testing, I've opened this PR to add a file that uses INT96_TIMESTAMP_ORDER: apache/parquet-testing#115. This file was generated using the parquet-java changes from apache/parquet-java#3610. Can you try reading it with these changes and validate that it works as expected?

@etseidl

etseidl commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Test failures are expected until apache/parquet-testing#115 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement PoC for Parquet GH-583: Introduce chronological ordering for INT96 timestamps

3 participants