Description
RunLengthBitPackingHybridDecoder allocates two new arrays per PACKED run during decode:
case PACKED:
int numGroups = header >>> 1;
currentCount = numGroups * 8;
currentBuffer = new int[currentCount]; // TODO: reuse a buffer
byte[] bytes = new byte[numGroups * bitWidth];
...
The TODO comment in the code itself flags this as a known improvement opportunity. Dictionary-id pages encoded with the RLE/bit-packed hybrid encoder produce many PACKED runs per page, so each readInt() traversal of a page allocates O(numGroups) intermediate arrays that are immediately discarded — pure GC pressure on a hot read path.
Proposed change
Reuse the two buffers across PACKED runs within the same decoder instance, growing them only when a larger run is encountered:
- Add
int[] packedValuesBuffer and byte[] packedBytesBuffer instance fields, initialized to new int[0] / new byte[0].
- In the PACKED branch, grow each buffer only when its length is below the required size.
- Add
currentBufferLength to track the logical length of the active region (since packedValuesBuffer.length may now exceed the current run).
This matches the pattern used by other decoders in the codebase that amortize buffer allocations.
Benchmark
A new RleDictionaryIndexDecodingBenchmark (added in #3511 / #3512) isolates the RLE/bit-packed dictionary-id decode path. Three index patterns over 100k INT32 values, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2 (20 measurement iterations):
| Pattern |
master (ops/s) |
optimized (ops/s) |
Improvement |
| SEQUENTIAL |
93,061,521 |
113,856,860 |
+22.3% |
| RANDOM |
92,929,824 |
114,238,638 |
+22.9% |
| LOW_CARDINALITY |
92,813,229 |
115,271,347 |
+24.2% |
End-to-end file read (FileReadBenchmark) sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages, but it confirms the signal is real.
Validation
- 573
parquet-column tests pass
- 9
TestRunLengthBitPackingHybridEncoder tests pass (these round-trip through the decoder)
- Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
Scope
~22 LOC change to a single file (RunLengthBitPackingHybridDecoder.java). Self-contained and obviously correct (resolves the existing TODO).
Related
Part of the focused performance PR series from https://github.com/iemejia/parquet-perf. The companion ByteStreamSplit writer/reader changes from the same source commit have already been submitted as #3504 (writer) and #3506 (reader).
Description
RunLengthBitPackingHybridDecoderallocates two new arrays per PACKED run during decode:The TODO comment in the code itself flags this as a known improvement opportunity. Dictionary-id pages encoded with the RLE/bit-packed hybrid encoder produce many PACKED runs per page, so each
readInt()traversal of a page allocates O(numGroups) intermediate arrays that are immediately discarded — pure GC pressure on a hot read path.Proposed change
Reuse the two buffers across PACKED runs within the same decoder instance, growing them only when a larger run is encountered:
int[] packedValuesBufferandbyte[] packedBytesBufferinstance fields, initialized tonew int[0]/new byte[0].currentBufferLengthto track the logical length of the active region (sincepackedValuesBuffer.lengthmay now exceed the current run).This matches the pattern used by other decoders in the codebase that amortize buffer allocations.
Benchmark
A new
RleDictionaryIndexDecodingBenchmark(added in #3511 / #3512) isolates the RLE/bit-packed dictionary-id decode path. Three index patterns over 100k INT32 values, BIT_WIDTH=10, JMH-wi 5 -i 10 -f 2(20 measurement iterations):End-to-end file read (
FileReadBenchmark) sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages, but it confirms the signal is real.Validation
parquet-columntests passTestRunLengthBitPackingHybridEncodertests pass (these round-trip through the decoder)-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=trueScope
~22 LOC change to a single file (
RunLengthBitPackingHybridDecoder.java). Self-contained and obviously correct (resolves the existing TODO).Related
Part of the focused performance PR series from https://github.com/iemejia/parquet-perf. The companion ByteStreamSplit writer/reader changes from the same source commit have already been submitted as #3504 (writer) and #3506 (reader).