[PoC] Implement Parquet GH-583 INT96 timestamp ColumnOrder#10106
[PoC] Implement Parquet GH-583 INT96 timestamp ColumnOrder#10106etseidl wants to merge 51 commits into
ColumnOrder#10106Conversation
|
@etseidl Can we ask someone to review this PR? I'd like to start a vote on the INT96 changes soon, but we should get some eyes on this PR before that. |
@divjotarora I don't think it's strictly necessary to have this reviewed before the spec change. More important would be to show that the PoCs can read each others files. Ideally we'd add a file to parquet-testing with the new sort order specified and make sure the new readers correctly get the stats, and old readers hopefully ignore the stats, but at least don't die when trying to read the file. If anyone (@alamb or @jhorstmann perhaps) has time for a quick look, the changes for this PR can be viewed here (diff against the branch from #10104). |
|
Thank you for the link to just the INT96 changes, they look good to me. |
@etseidl Great point about parquet-testing, I've opened this PR to add a file that uses |
|
Test failures are expected until apache/parquet-testing#115 is merged. |
Which issue does this PR close?
parquet::basic::ColumnOrder::sort_order_for_type#10104 (which depends on Implement PARQUET-2249: Introduce IEEE 754 total order #9619)Rationale for this change
Spark continues to use INT96 timestamps, despite INT96 being marked as deprecated in 2018. Query engines want valid statistics to allow reliably pruning on INT96 columns. apache/parquet-format#584 adds a new
ColumnOrdervariant which can be used to signal compliance with the only known use of INT96 (4-byte julian day from epoch, 8-byte nanosecond).What changes are included in this PR?
Adds support for the new enum variant, and writes the appropriate value in the
FileMetaData.column_ordersfield.This builds on changes introduced in #7687.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes, this adds a new variant to public enums (
ColumnOrder::INT96_TIMESTAMP_ORDER,SortOrder::INT96_TIMESTAMP).