Skip to content

Cleanup binary write path: avoid Binary.getBytes() copy in DeltaByteArrayWriter and remove LE wrapper from FixedLenByteArrayPlainValuesWriter #3520

@iemejia

Description

@iemejia

Background

Two small cleanups in the binary write path:

1. DeltaByteArrayWriter.writeBytes(Binary): avoid unconditional copy of the input bytes

The first line is:

byte[] vb = v.getBytes();

Binary.getBytes() is contractually required to return a fresh array the caller can keep and mutate. For ByteArrayBackedBinary (the most common case from Binary.fromConstantByteArray() and similar), the implementation does an unconditional Arrays.copyOf of the backing array — but DeltaByteArrayWriter only reads vb and then drops it. The copy is wasted work on every value.

The right call here is v.copy().getBytesUnsafe():

  • Binary.copy() is a no-op (return this) for constant Binaries that are already independent of any reused buffer.
  • For reused-buffer Binaries (e.g. ByteBufferBackedBinary over a slab being mutated), copy() snapshots them — preserving correctness.
  • getBytesUnsafe() then returns the backing array directly without a defensive copy.

For the common ByteArrayBackedBinary case this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.

2. FixedLenByteArrayPlainValuesWriter: drop the unused LittleEndianDataOutputStream wrapper

Same pattern as #3516 fixed in DeltaLengthByteArrayValuesWriter: the writer wraps its CapacityByteArrayOutputStream with a LittleEndianDataOutputStream that's only used to call Binary.writeTo() — i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.). Binary.writeTo(arrayOut) works directly with the underlying stream.

The trailing out.flush() in getBytes() is also dead — CapacityByteArrayOutputStream doesn't buffer, and the LE wrapper has no buffer of its own (per the bulk-write change in #3518, the writeBuffer[] is a per-call scratch).

Files affected

  • parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java
  • parquet-column/src/main/java/org/apache/parquet/column/values/plain/FixedLenByteArrayPlainValuesWriter.java

No public API change. No file format change.

Expected impact

Both changes are small, locally obvious, and align with the broader migration away from LittleEndianDataOutputStream already in flight (PRs #3496, #3518).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions