Skip to content

Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) #3522

@iemejia

Description

@iemejia

Description

RunLengthBitPackingHybridDecoder allocates two new arrays per PACKED run during decode:

case PACKED:
    int numGroups = header >>> 1;
    currentCount = numGroups * 8;
    currentBuffer = new int[currentCount]; // TODO: reuse a buffer
    byte[] bytes = new byte[numGroups * bitWidth];
    ...

The TODO comment in the code itself flags this as a known improvement opportunity. Dictionary-id pages encoded with the RLE/bit-packed hybrid encoder produce many PACKED runs per page, so each readInt() traversal of a page allocates O(numGroups) intermediate arrays that are immediately discarded — pure GC pressure on a hot read path.

Proposed change

Reuse the two buffers across PACKED runs within the same decoder instance, growing them only when a larger run is encountered:

  • Add int[] packedValuesBuffer and byte[] packedBytesBuffer instance fields, initialized to new int[0] / new byte[0].
  • In the PACKED branch, grow each buffer only when its length is below the required size.
  • Add currentBufferLength to track the logical length of the active region (since packedValuesBuffer.length may now exceed the current run).

This matches the pattern used by other decoders in the codebase that amortize buffer allocations.

Benchmark

A new RleDictionaryIndexDecodingBenchmark (added in #3511 / #3512) isolates the RLE/bit-packed dictionary-id decode path. Three index patterns over 100k INT32 values, BIT_WIDTH=10, JMH -wi 5 -i 10 -f 2 (20 measurement iterations):

Pattern master (ops/s) optimized (ops/s) Improvement
SEQUENTIAL 93,061,521 113,856,860 +22.3%
RANDOM 92,929,824 114,238,638 +22.9%
LOW_CARDINALITY 92,813,229 115,271,347 +24.2%

End-to-end file read (FileReadBenchmark) sees a much smaller ~2% improvement because RLE decoding is only one of many pipeline stages, but it confirms the signal is real.

Validation

  • 573 parquet-column tests pass
  • 9 TestRunLengthBitPackingHybridEncoder tests pass (these round-trip through the decoder)
  • Built with -Dspotless.check.skip=true -Drat.skip=true -Djapicmp.skip=true

Scope

~22 LOC change to a single file (RunLengthBitPackingHybridDecoder.java). Self-contained and obviously correct (resolves the existing TODO).

Related

Part of the focused performance PR series from https://github.com/iemejia/parquet-perf. The companion ByteStreamSplit writer/reader changes from the same source commit have already been submitted as #3504 (writer) and #3506 (reader).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions