Background
Two small cleanups in the binary write path:
1. DeltaByteArrayWriter.writeBytes(Binary): avoid unconditional copy of the input bytes
The first line is:
byte[] vb = v.getBytes();
Binary.getBytes() is contractually required to return a fresh array the caller can keep and mutate. For ByteArrayBackedBinary (the most common case from Binary.fromConstantByteArray() and similar), the implementation does an unconditional Arrays.copyOf of the backing array — but DeltaByteArrayWriter only reads vb and then drops it. The copy is wasted work on every value.
The right call here is v.copy().getBytesUnsafe():
Binary.copy() is a no-op (return this) for constant Binaries that are already independent of any reused buffer.
- For reused-buffer Binaries (e.g.
ByteBufferBackedBinary over a slab being mutated), copy() snapshots them — preserving correctness.
getBytesUnsafe() then returns the backing array directly without a defensive copy.
For the common ByteArrayBackedBinary case this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.
2. FixedLenByteArrayPlainValuesWriter: drop the unused LittleEndianDataOutputStream wrapper
Same pattern as #3516 fixed in DeltaLengthByteArrayValuesWriter: the writer wraps its CapacityByteArrayOutputStream with a LittleEndianDataOutputStream that's only used to call Binary.writeTo() — i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.). Binary.writeTo(arrayOut) works directly with the underlying stream.
The trailing out.flush() in getBytes() is also dead — CapacityByteArrayOutputStream doesn't buffer, and the LE wrapper has no buffer of its own (per the bulk-write change in #3518, the writeBuffer[] is a per-call scratch).
Files affected
parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java
parquet-column/src/main/java/org/apache/parquet/column/values/plain/FixedLenByteArrayPlainValuesWriter.java
No public API change. No file format change.
Expected impact
Both changes are small, locally obvious, and align with the broader migration away from LittleEndianDataOutputStream already in flight (PRs #3496, #3518).
Background
Two small cleanups in the binary write path:
1.
DeltaByteArrayWriter.writeBytes(Binary): avoid unconditional copy of the input bytesThe first line is:
Binary.getBytes()is contractually required to return a fresh array the caller can keep and mutate. ForByteArrayBackedBinary(the most common case fromBinary.fromConstantByteArray()and similar), the implementation does an unconditionalArrays.copyOfof the backing array — butDeltaByteArrayWriteronly readsvband then drops it. The copy is wasted work on every value.The right call here is
v.copy().getBytesUnsafe():Binary.copy()is a no-op (return this) for constant Binaries that are already independent of any reused buffer.ByteBufferBackedBinaryover a slab being mutated),copy()snapshots them — preserving correctness.getBytesUnsafe()then returns the backing array directly without a defensive copy.For the common
ByteArrayBackedBinarycase this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.2.
FixedLenByteArrayPlainValuesWriter: drop the unusedLittleEndianDataOutputStreamwrapperSame pattern as #3516 fixed in
DeltaLengthByteArrayValuesWriter: the writer wraps itsCapacityByteArrayOutputStreamwith aLittleEndianDataOutputStreamthat's only used to callBinary.writeTo()— i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.).Binary.writeTo(arrayOut)works directly with the underlying stream.The trailing
out.flush()ingetBytes()is also dead —CapacityByteArrayOutputStreamdoesn't buffer, and the LE wrapper has no buffer of its own (per the bulk-write change in #3518, thewriteBuffer[]is a per-call scratch).Files affected
parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.javaparquet-column/src/main/java/org/apache/parquet/column/values/plain/FixedLenByteArrayPlainValuesWriter.javaNo public API change. No file format change.
Expected impact
BinaryEncodingBenchmark.encodeDeltaByteArray(short strings): +5% to +10% standalone (the per-value copy is one of several overheads; this PR removes one of them; it stacks with GH-3516: Optimize DeltaByteArrayWriter and DeltaLengthByteArrayValuesWriter (+33-55% encodeDeltaByteArray) #3517 which removed the suffix-side allocation).FixedLenByteArrayPlainValuesWriter: code-quality cleanup; removes a per-value layer of dispatch and an unnecessaryflush()call. No headline benchmark.Both changes are small, locally obvious, and align with the broader migration away from
LittleEndianDataOutputStreamalready in flight (PRs #3496, #3518).