Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode)

## Description

`RunLengthBitPackingHybridDecoder` allocates two new arrays per PACKED run during decode:

```java
case PACKED:
    int numGroups = header >>> 1;
    currentCount = numGroups * 8;
    currentBuffer = new int[currentCount]; // TODO: reuse a buffer
    byte[] bytes = new byte[numGroups * bitWidth];
    ...
```

The TODO comment in the code itself flags this as a known improvement opportunity. Dictionary-id pages encoded with the RLE/bit-packed hybrid encoder produce many PACKED runs per page, so each `readInt()` traversal of a page allocates O(numGroups) intermediate arrays that are immediately discarded — pure GC pressure on a hot read path.

## Proposed change

Reuse the two buffers across PACKED runs within the same decoder instance, growing them only when a larger run is encountered:

- Add `int[] packedValuesBuffer` and `byte[] packedBytesBuffer` instance fields, initialized to `new int[0]` / `new byte[0]`.
- In the PACKED branch, grow each buffer only when its length is below the required size.
- Add `currentBufferLength` to track the logical length of the active region (since `packedValuesBuffer.length` may now exceed the current run).

This matches the pattern used by other decoders in the codebase that amortize buffer allocations.

## Benchmark

A new `RleDictionaryIndexDecodingBenchmark` (added in #3511 / #3512) isolates the RLE/bit-packed dictionary-id decode path. Three index patterns over 100k INT32 values, BIT_WIDTH=10, JMH `-wi 5 -i 10 -f 2` (20 measurement iterations):

| Pattern         | master (ops/s) | optimized (ops/s) | Improvement |
|-----------------|---------------:|------------------:|:-----------:|
| SEQUENTIAL      |     93,061,521 |       113,856,860 | **+22.3%**  |
| RANDOM          |     92,929,824 |       114,238,638 | **+22.9%**  |
| LOW_CARDINALITY |     92,813,229 |       115,271,347 | **+24.2%**  |

End-to-end file read (`FileReadBenchmark`) sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages, but it confirms the signal is real.

## Validation

- 573 `parquet-column` tests pass
- 9 `TestRunLengthBitPackingHybridEncoder` tests pass (these round-trip through the decoder)
- Built with `-Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true`

## Scope

~22 LOC change to a single file (`RunLengthBitPackingHybridDecoder.java`). Self-contained and obviously correct (resolves the existing TODO).

## Related

Part of the focused performance PR series from https://github.com/iemejia/parquet-perf. The companion ByteStreamSplit writer/reader changes from the same source commit have already been submitted as #3504 (writer) and #3506 (reader).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) #3522

Description

Proposed change

Benchmark

Validation

Scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pattern	master (ops/s)	optimized (ops/s)	Improvement
SEQUENTIAL	93,061,521	113,856,860	+22.3%
RANDOM	92,929,824	114,238,638	+22.9%
LOW_CARDINALITY	92,813,229	115,271,347	+24.2%

Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) #3522

Description

Description

Proposed change

Benchmark

Validation

Scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions