feat: use aligned slice access during bulk append in SparkUnsafeArray by sandugood · Pull Request #4672 · apache/datafusion-comet

sandugood · 2026-06-18T00:32:55Z

Which issue does this PR close?

Rationale for this change

Currently, there is a bottleneck in performance during bulk append in the SparkUnsafeArray implementation (in macro and respective Builder types for bool, date32 and timestamp). If the array is NULLABLE there is a hotspot:

we check each corresponding element at the index for nullability using Self::is_null_in_bitset() which is suboptimal (see the benchmarking results below)

What changes are included in this PR?

With this PR we change the flow of execution for nullable arrays:

reading raw bytes from the null bitmap and flipping them according to Spark vs Arrow logic
constructing BooleanBuffer from the bytes and then creating a NullBuffer
creating a PrimitiveArray (with type specified in the macro_rules! and inside the according methods in the SparkUnsafeArray for several dtypes) from slice data
appending this array to the current builder

How are these changes tested?

All of the basic tests pass.
Here is the benchmarking result (copied from the comment below).
Result from with_nulls version of benches (no_nulls wasnt affected with this PR)

array_type	main (current)	incoming PR	time_reduce
i32	29.274μs	2.4256μs	-91.71%
i64	27.260μs	4.7887μs	-82.43%
f64	29.352μs	4.6621μs	-84.11%
date32	26.863μs	2.5345μs	-89.89%
timestamp	45.818μs	4.3659μs	-90.47%

…g the CI

…ccess

sandugood · 2026-06-18T13:42:41Z

When running CI pipeline, got few UB.
It was caused presumably by the bit_offset and Spark <-> Arrow null bitmask tinkering.

Switched to a less optimal way, which still provided boost in performance. By materializing &[bool] and passing it to builder's append_values()

array_type	main (current)	incoming PR	time_reduce
i32	30.030μs	21.250μs	-29.23%
i64	27.572μs	21.996μs	-20.22%
f64	30.059μs	21.889μs	-27.17%
date32	28.693μs	21.041μs	-26.66%
timestamp	46.832μs	21.888μs	-53.26%

sandugood · 2026-06-18T13:43:06Z

Could somebody potentially trigger the CI run? Thanks in advance

…ccess

andygrove · 2026-06-18T14:34:29Z

Thanks for picking this up. The shape of the change is exactly what #4626 was asking for: keep the runtime alignment check as load-bearing, and replace the per-element null branch with PrimitiveBuilder::append_values(values, &is_valid). The benchmark numbers look great.

A few things to look at, the first one is a correctness concern.

Boolean nullable path may have UB. In append_booleans_to_builder, the new nullable path casts *const u8 to *const bool and iterates:
```
let slice = unsafe { std::slice::from_raw_parts(self.element_offset as *const bool, num_elements) };
for (idx, &value) in slice.iter().enumerate() { ... }
```
Rust requires every bool value to be exactly 0 or 1. For null slots, the underlying byte is JVM-uninitialized memory, so the let &value materialization in the iterator pulls out a bool from arbitrary bytes which is UB even when the value is then ignored. The non-nullable branch a few lines down correctly uses *const u8 plus value != 0. Could you mirror that pattern in the nullable branch? cargo +nightly miri test would give a definitive answer, and it might be related to the UB you mentioned hitting earlier.
Boolean nullable path still branches per-element. Once the cast is fixed, the loop still does a per-element null check, which is the shape perf: use aligned slice access in SparkUnsafeArray bulk append #4626 specifically called out as the bigger win for booleans. BooleanBuilder::append_values(values: &[bool], is_valid: Option<&[bool]>) would absorb the validity in bulk and remove the branch. Worth taking here if straightforward.
Vec<bool> allocation per call. The simpler Vec<bool> form is fine for landing, but perf: use aligned slice access in SparkUnsafeArray bulk append #4626 suggested feeding a BooleanBuffer directly to the builder to skip the bool materialization. If you'd rather defer that, a follow-up issue with the link to the previous discussion would be helpful. At minimum a Vec::with_capacity(num_elements) instead of collect would avoid one growth.
Stale assert message in append_dates_to_builder. The new debug_assert! message reads "append_timestamps: element_offset is null". Copy-paste from the timestamp path, should be append_dates.
Comment phrasing on the alignment fallback. Could you reword the // Note: alignment is not guaranteed - that is why do this comments to point at the existing explanation at list.rs:105 and say why (nested-array variable-length region)? The prior PR was bounced because the description left the runtime check looking redundant, and a clear comment here is the easiest way to keep that from happening again.
PR description. The benchmark numbers in your comment look like the right shape. Could you pull them into the PR description body and mention they came from cargo bench --bench array_element_append?

I'll approve the CI workflows so the matrix actually exercises the change. Out of curiosity, what was the UB you hit earlier? Worth knowing whether item 1 above was the cause or there's a separate hazard.

Disclosure: this review was assisted by an AI tool (Anthropic Claude via Claude Code). I read the diff, the linked issue, and the prior closed PR's review feedback before forming the comments above, and I take responsibility for them.

…ccess

…Vec<> loops

sandugood · 2026-06-19T10:48:30Z

Thank you for the review @andygrove

I have changed my approach and bechmarked it.
My current approach is about using arrow internal types and buffers (essentially NullBuffer, from the underlying BooleanBufer, which is constructed from the bitmap, although with bits flipped), without materializing and looping through Vec<> which was still an improvement, but not a drastic one.

Posting the benchmarking results, which I ran using cargo bench --bench array_element_append. Posting result from with_nulls version of benches (no_nulls wasnt affected with this PR)

array_type	main (current)	incoming PR	time_reduce
i32	29.274μs	2.4256μs	-91.71%
i64	27.260μs	4.7887μs	-82.43%
f64	29.352μs	4.6621μs	-84.11%
date32	26.863μs	2.5345μs	-89.89%
timestamp	45.818μs	4.3659μs	-90.47%

cc @mbutrovich, if you could take a quick look at this, please. Thanks!

andygrove · 2026-06-25T14:22:45Z

Thanks for the rework @sandugood, and for pulling the benchmark numbers and the cargo bench invocation into the description. The primitive, date, and timestamp paths are implemented the way #4626 was after. Reading the values as a slice and building a NullBuffer from the flipped Spark bitmap is the right shape, and the speedups are great.

One thing I want to call out as a nice catch: threading the timezone through and calling .with_timezone_opt(timezone) is actually load-bearing, not just defensive. PrimitiveBuilder::append_array asserts data_type equality (primitive_builder.rs:291), so without it the timestamp path would panic whenever the column carries a timezone. Good that you handled it.

A few items before this is ready:

1. The boolean nullable path is UB, and this revision moved it the wrong way. This is item 1 from the earlier review. The new nullable branch does:

let slice = unsafe { std::slice::from_raw_parts(self.element_offset as *const bool, num_elements) };
for (idx, &value) in slice.iter().enumerate() { ... }

Rust requires every bool to be exactly 0 or 1. For null slots the JVM byte is not guaranteed to be 0/1, and for (idx, &value) materializes the bool out of each byte even for slots that get skipped, which is UB. The previous version of this branch read *const u8 and did *ptr != 0, and the non-nullable branch right below still does the safe thing:

let values = unsafe { std::slice::from_raw_parts(self.element_offset as *const u8, num_elements) };
for &value in values { builder.append_value(value != 0); }

Could you mirror that *const u8 pattern in the nullable branch (builder.append_value(value != 0))? Since Spark stores booleans as one byte each rather than bit-packed, a per-element loop here is reasonable, so the u8 cast is really all that is needed. This is very likely the UB you hit in CI earlier, since there is a miri workflow. Worth confirming against a nullable boolean array with a null slot once the cast is in.

2. Copy-paste leftover in append_dates_to_builder. The debug_assert! message reads "append_timestamps: element_offset is null" and the comment above says contiguous i64 data, but this is the dates path. Both should say append_dates and i32.

3. A couple of targeted tests would help. The bitmap flip and the partial-byte handling are easy to get subtly wrong. Would you be open to adding unit tests for nullable arrays where num_elements is not a multiple of 8 with a null in the final byte (say 7, 9, 17), plus all-null and all-valid cases? That locks in the trailing-bit behavior, which currently relies on the BooleanBuffer length trimming the flipped padding bits.

4. Formatting. The process_primitive_lists macro in row.rs looks like the indentation drifted. A quick cargo fmt should clean it up.

5. Minor. The byte reinterpretation of the null word assumes little-endian bit ordering. That matches Arrow and every platform we target, so it is fine, but a one-line comment next to the null_words as *const u8 cast noting the little-endian assumption would help the next reader.

The boolean path is the only blocker. Everything else is small. Once that is a u8 read I think this is in good shape.

Disclosure: this review was assisted by an AI tool (Anthropic Claude via Claude Code). I read the diff, the linked issue, and the prior review thread before forming these comments, and I take responsibility for them.

sandugood added 6 commits June 18, 2026 02:14

Builder appending refactoring

58189a1

Formatting, small refactoring and adding docstrings

7778a84

Added docstring regarding alignment

2bee010

Fixed the fallback in an unaligned but NULLABLE case, that was failin…

6abf1f7

…g the CI

Merge remote-tracking branch 'upstream/main' into feat/enable-slice-a…

7439706

…ccess

Changes fixed

74265c7

sandugood added 2 commits June 18, 2026 16:43

Merge remote-tracking branch 'upstream/main' into feat/enable-slice-a…

d2c85e4

…ccess

Fixed cargo formatting

662a7f2

sandugood added 3 commits June 19, 2026 10:40

Merge remote-tracking branch 'upstream/main' into feat/enable-slice-a…

7bf6f15

…ccess

Remade the building process using arrow buffers with no hot-spots in …

157c73f

…Vec<> loops

Formatting fix

0e5fcad

Added docstring about runtime check of alignment

42bd9dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: use aligned slice access during bulk append in SparkUnsafeArray#4672

feat: use aligned slice access during bulk append in SparkUnsafeArray#4672
sandugood wants to merge 12 commits into
apache:mainfrom
sandugood:feat/enable-slice-access

sandugood commented Jun 18, 2026 •

edited

Loading

Uh oh!

sandugood commented Jun 18, 2026

Uh oh!

sandugood commented Jun 18, 2026

Uh oh!

andygrove commented Jun 18, 2026

Uh oh!

sandugood commented Jun 19, 2026

Uh oh!

andygrove commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sandugood commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

sandugood commented Jun 18, 2026

Uh oh!

sandugood commented Jun 18, 2026

Uh oh!

andygrove commented Jun 18, 2026

Uh oh!

sandugood commented Jun 19, 2026

Uh oh!

andygrove commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sandugood commented Jun 18, 2026 •

edited

Loading