GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) by iemejia · Pull Request #3523 · apache/parquet-java

iemejia · 2026-04-21T17:18:05Z

Summary

Closes #3522.

RunLengthBitPackingHybridDecoder allocates a new int[] and byte[] on every PACKED run during decode. The code itself flagged this with a // TODO: reuse a buffer comment. This PR resolves the TODO by reusing the buffers across runs within the same decoder instance, growing them lazily only when a larger run is encountered.

Also adds a currentBufferLength field to track the logical active-region length in packedValuesBuffer (since packedValuesBuffer.length may now exceed the current run's size after a prior larger run grew it).

Benchmark

RleDictionaryIndexDecodingBenchmark (added in #3512) isolates the RLE/bit-packed dictionary-id decode path. 100k INT32 dictionary IDs, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2 (20 measurement iterations):

Pattern	master (ops/s)	optimized (ops/s)	Improvement
SEQUENTIAL	93,061,521	113,856,860	+22.3%
RANDOM	92,929,824	114,238,638	+22.9%
LOW_CARDINALITY	92,813,229	115,271,347	+24.2%

End-to-end FileReadBenchmark sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages; the isolated micro-benchmark shows the true magnitude on the affected code path.

Validation

parquet-column: 573 tests pass
TestRunLengthBitPackingHybridEncoder: 9 tests pass (these round-trip values through the decoder)
Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

Scope

17 LOC change to a single file. Self-contained and obviously correct (resolves the existing TODO).

How to reproduce

The benchmark is added in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar \
    'RleDictionaryIndexDecodingBenchmark' -wi 5 -i 10 -f 2

arouel

Have you seen the PR #3467 which addresses the same issue? I would prefer if we help each other in reviewing instead of creating more PRs of the same kind.

Fokko

Thanks for fixing this outstanding TODO @iemejia 🙌

I left one comment because I think it would good to update to the non-deprecated API of parquet-encoding while at it. WDYT?

Fokko · 2026-04-28T18:05:16Z

        int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
        bytesToRead = Math.min(bytesToRead, in.available());
-        new DataInputStream(in).readFully(bytes, 0, bytesToRead);
+        new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead);


I think it is on purpose to not close this resource, should we add a comment to it?

Fokko · 2026-04-28T18:06:35Z

            valueIndex < currentCount;
            valueIndex += 8, byteIndex += bitWidth) {
-          packer.unpack8Values(bytes, byteIndex, currentBuffer, valueIndex);
+          packer.unpack8Values(packedBytesBuffer, byteIndex, currentBuffer, valueIndex);


I noticed that unpack8Values(final byte[] input, ...) is deprecated, should we switch to the ByteBuffer overload while we're reworking this?

diff --git a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java index da9db00c7..14ec1a87d 100644 --- a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java +++ b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java @@ -21,6 +21,7 @@ package org.apache.parquet.column.values.rle; import java.io.DataInputStream; import java.io.IOException; import java.io.InputStream; +import java.nio.ByteBuffer; import org.apache.parquet.Preconditions; import org.apache.parquet.bytes.BytesUtils; import org.apache.parquet.column.values.bitpacking.BytePacker; @@ -52,7 +53,7 @@ public class RunLengthBitPackingHybridDecoder { // Reusable buffers to avoid per-run allocation in PACKED mode private int[] packedValuesBuffer = new int[0]; - private byte[] packedBytesBuffer = new byte[0]; + private ByteBuffer packedBytesBuffer = ByteBuffer.allocate(0); public RunLengthBitPackingHybridDecoder(int bitWidth, InputStream in) { LOG.debug("decoding bitWidth {}", bitWidth); @@ -102,13 +103,13 @@ public class RunLengthBitPackingHybridDecoder { } currentBuffer = packedValuesBuffer; int bytesRequired = numGroups * bitWidth; - if (packedBytesBuffer.length < bytesRequired) { - packedBytesBuffer = new byte[bytesRequired]; + if (packedBytesBuffer.capacity() < bytesRequired) { + packedBytesBuffer = ByteBuffer.allocate(bytesRequired); } // At the end of the file RLE data though, there might not be that many bytes left. int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0); bytesToRead = Math.min(bytesToRead, in.available()); - new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead); + new DataInputStream(in).readFully(packedBytesBuffer.array(), 0, bytesToRead); for (int valueIndex = 0, byteIndex = 0; valueIndex < currentCount; valueIndex += 8, byteIndex += bitWidth) {

…dDecoder PACKED path Allocate the int[] values buffer and byte[] read-staging buffer once per decoder and grow them lazily, instead of allocating fresh arrays on every PACKED run. Resolves the existing "TODO: reuse a buffer" comment. A new currentBufferLength field tracks the logical length of the active region in packedValuesBuffer (which may now exceed the current run's size after a prior larger run grew it). Benchmark (RleDictionaryIndexDecodingBenchmark, 100k INT32, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2): Pattern | master ops/s | optimized ops/s | Improvement SEQUENTIAL | 93,061,521 | 113,856,860 | +22.3% RANDOM | 92,929,824 | 114,238,638 | +22.9% LOW_CARDINALITY | 92,813,229 | 115,271,347 | +24.2% End-to-end FileReadBenchmark sees ~2% improvement (RLE decoding is a small fraction of full file reads). Validation: 573 parquet-column tests pass. Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.

…eam to direct ByteBuffer Replace the InputStream-based I/O in RunLengthBitPackingHybridDecoder with direct ByteBuffer access using LITTLE_ENDIAN byte order. This eliminates the InputStream/DataInputStream abstraction layer and enables more efficient reads via buffer.get(), buffer.getShort(), buffer.getInt() intrinsics. Changes: - Add ByteBuffer overloads for BytesUtils.readUnsignedVarInt() and readIntLittleEndianPaddedOnBitWidth() - Update all callers (DictionaryValuesReader, RunLengthBitPackingHybrid- ValuesReader, ColumnReaderBase) to pass ByteBuffer instead of InputStream - Remove IOException from readInt() since ByteBuffer operations don't throw checked exceptions, simplifying all call sites - Update tests to use ByteBuffer API directly All 573 parquet-column tests pass.

Buffer four 8-value groups (32 values) before packing, then use the packer's pack32Values entry point instead of calling pack8Values four separate times. This reduces per-group overhead and enables the packer's optimized 32-value code path. When fewer than 32 values remain at run end, flushBitPackedValues falls back to per-8 packing via pack8Values. All 573 parquet-column tests pass.

…KED branch Symmetric to the encoder's pack32Values fast path, the decoder's PACKED branch now batches 4 groups (32 values) into a single unpack32Values call instead of looping unpack8Values four times. Falls back to unpack8Values for residual <4-group tails. This benefits long PACKED runs (>=32 values) by reducing loop overhead and enabling the packer's optimized 32-value code path. Combined benchmark for the full par9 branch (all 4 commits: buffer reuse + ByteBuffer conversion + pack32Values encoder + unpack32Values decoder): RleDictionaryIndexDecodingBenchmark (100k dictionary IDs, JMH -wi 3 -i 5 -f 1): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 603,445,362 698,066,810 +16% (1.16x) RANDOM 613,691,096 681,685,407 +11% (1.11x) LOW_CARDINALITY 611,963,736 686,200,341 +12% (1.12x) IntEncodingBenchmark.decodeDictionary (100k INT32 values, full dictionary decode path including RLE index decode): Pattern Before (ops/s) After (ops/s) Improvement SEQUENTIAL 418,357,276 539,458,940 +29% (1.29x) RANDOM 417,041,197 527,231,831 +26% (1.26x) LOW_CARDINALITY 605,354,083 628,283,691 +4% HIGH_CARDINALITY 416,731,808 535,763,242 +29% (1.29x) All 573 parquet-column tests pass.

iemejia · 2026-05-17T22:41:50Z

Closing in favor of #3568.

I initially submitted a series of small, focused PRs thinking they'd be easier to review. In practice the sheer number (~16 PRs, with more pending) made things harder to follow — even for me. I've regrouped the changes by encoding type / performance area so that each PR is self-contained with its own benchmarks and test coverage, which should make review and performance analysis much more straightforward.

Apologies for the churn. If you've been reviewing this PR, please continue the discussion on #3568 which supersedes it. Thank you.

arouel reviewed Apr 21, 2026

View reviewed changes

Fokko approved these changes Apr 28, 2026

View reviewed changes

iemejia force-pushed the perf-rle-buffer-reuse branch 2 times, most recently from 38a48ed to af02aef Compare May 1, 2026 22:07

iemejia mentioned this pull request May 6, 2026

Apache Parquet Java Performance Improvements #3530

Open

iemejia added 5 commits May 12, 2026 00:00

Add japicmp exclusions for RunLengthBitPackingHybridDecoder refactoring

007da93

iemejia force-pushed the perf-rle-buffer-reuse branch from af02aef to 007da93 Compare May 11, 2026 22:04

iemejia marked this pull request as draft May 15, 2026 09:35

iemejia closed this May 17, 2026

iemejia deleted the perf-rle-buffer-reuse branch May 17, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523
iemejia wants to merge 5 commits into
apache:masterfrom
iemejia:perf-rle-buffer-reuse

iemejia commented Apr 21, 2026

Uh oh!

arouel left a comment

Uh oh!

Fokko left a comment

Uh oh!

Fokko Apr 28, 2026

Uh oh!

Fokko Apr 28, 2026

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

iemejia commented Apr 21, 2026

Summary

Benchmark

Validation

Scope

Related

How to reproduce

Uh oh!

arouel left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Fokko Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

iemejia commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants