[parquet] Add map shredding for hot keys#7877
Conversation
|
Benchmark command: mvn -s ~/.m2/apache-community.xml -pl paimon-format -am -Pfast-build \
-DfailIfNoTests=false -Dtest=MapShreddingStorageBenchmark testBenchmark file: [MapShreddingStorageBenchmark.java] Common Setup
Results
Scenario Details
Conclusion: in this synthetic storage benchmark, map shredding reduces file size in both cases. The biggest gain appears when hot map keys are long and repeated across many rows, saving about |
|
This looks very suitable to be solved using Variant, why not? |
Hi @JingsongLi here are two reason we considering to introduce the shredding to the map
|
Can you demonstrate some benchmarks? As you said, the difference in storage between map and variant? |
Purpose
Add Parquet map shredding support for
MAP<STRING, T>columns.This allows selected map columns to extract hot keys into independent physical Parquet columns while preserving the original logical map schema for readers. The feature is controlled by
map.shredding.*options, aligned with the existingvariant.shredding.*naming style. It also adds a focused round-trip test and a storage benchmark to validate the storage benefit.Tests
mvn -pl paimon-api,paimon-format -Pfast-build -DskipTests compilemvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false -Dtest=ParquetFormatReadWriteTest#testMapShreddingRoundTrip,MapShreddingStorageBenchmark testgit diff --checkPhysical Layout
This change does not introduce a new Parquet logical type and does not modify the standard Parquet
MAPencoding. A shredded map is still written with the regular Parquet map group as the residual map. Hot keys are promoted into additional sibling sidecar columns in the parent Parquet group.For example, a logical field:
is normally written as:
With map shredding enabled, if
user-agentandhostare selected as hot keys, the physical Parquet schema becomes:The footer metadata records the mapping from sidecar columns to map keys:
During writing, entries for promoted hot keys are omitted from the residual map when their values are non-null, and their values are written into the corresponding sidecar columns. During reading, Paimon reads both the residual map and the sidecar columns, then reconstructs the original logical
MAP<STRING, T>value.For nested maps, the same rule applies within the containing row group. For example, for
payload.headers, sidecar columns are added as siblings of theheadersmap inside thepayloadgroup, and the footer metadata uses the full logical path:#7876