Skip to content

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523

Open
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-rle-buffer-reuse
Open

GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)#3523
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:perf-rle-buffer-reuse

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Apr 21, 2026

Summary

Closes #3522.

RunLengthBitPackingHybridDecoder allocates a new int[] and byte[] on every PACKED run during decode. The code itself flagged this with a // TODO: reuse a buffer comment. This PR resolves the TODO by reusing the buffers across runs within the same decoder instance, growing them lazily only when a larger run is encountered.

Also adds a currentBufferLength field to track the logical active-region length in packedValuesBuffer (since packedValuesBuffer.length may now exceed the current run's size after a prior larger run grew it).

Benchmark

RleDictionaryIndexDecodingBenchmark (added in #3512) isolates the RLE/bit-packed dictionary-id decode path. 100k INT32 dictionary IDs, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2 (20 measurement iterations):

Pattern master (ops/s) optimized (ops/s) Improvement
SEQUENTIAL 93,061,521 113,856,860 +22.3%
RANDOM 92,929,824 114,238,638 +22.9%
LOW_CARDINALITY 92,813,229 115,271,347 +24.2%

End-to-end FileReadBenchmark sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages; the isolated micro-benchmark shows the true magnitude on the affected code path.

Validation

  • parquet-column: 573 tests pass
  • TestRunLengthBitPackingHybridEncoder: 9 tests pass (these round-trip values through the decoder)
  • Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

Scope

17 LOC change to a single file. Self-contained and obviously correct (resolves the existing TODO).

Related

Part of the focused performance PR series from https://github.com/iemejia/parquet-perf. The companion ByteStreamSplit writer/reader changes from the same source commit (ba52f82c3) have already been submitted as #3504 and #3506.

How to reproduce

The benchmark is added in #3512. Once that lands, reproduce with:

./mvnw clean package -pl parquet-benchmarks -DskipTests \
    -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true
java -jar parquet-benchmarks/target/parquet-benchmarks.jar \
    'RleDictionaryIndexDecodingBenchmark' -wi 5 -i 10 -f 2

…dDecoder PACKED path

Allocate the int[] values buffer and byte[] read-staging buffer once per
decoder and grow them lazily, instead of allocating fresh arrays on every
PACKED run. Resolves the existing "TODO: reuse a buffer" comment.

A new currentBufferLength field tracks the logical length of the active
region in packedValuesBuffer (which may now exceed the current run's
size after a prior larger run grew it).

Benchmark (RleDictionaryIndexDecodingBenchmark, 100k INT32, BIT_WIDTH=10,
JMH -wi 5 -i 10 -f 2):

  Pattern         | master ops/s | optimized ops/s | Improvement
  SEQUENTIAL      |  93,061,521  |   113,856,860   |   +22.3%
  RANDOM          |  92,929,824  |   114,238,638   |   +22.9%
  LOW_CARDINALITY |  92,813,229  |   115,271,347   |   +24.2%

End-to-end FileReadBenchmark sees ~2% improvement (RLE decoding is a
small fraction of full file reads).

Validation: 573 parquet-column tests pass. Built with
-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true.
Copy link
Copy Markdown
Contributor

@arouel arouel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you seen the PR #3467 which addresses the same issue? I would prefer if we help each other in reviewing instead of creating more PRs of the same kind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)

2 participants