Skip to content

feat: use aligned slice access during bulk append in SparkUnsafeArray#4672

Open
sandugood wants to merge 12 commits into
apache:mainfrom
sandugood:feat/enable-slice-access
Open

feat: use aligned slice access during bulk append in SparkUnsafeArray#4672
sandugood wants to merge 12 commits into
apache:mainfrom
sandugood:feat/enable-slice-access

Conversation

@sandugood

@sandugood sandugood commented Jun 18, 2026

Copy link
Copy Markdown

Which issue does this PR close?

Closes #4626

Rationale for this change

Currently, there is a bottleneck in performance during bulk append in the SparkUnsafeArray implementation (in macro and respective Builder types for bool, date32 and timestamp). If the array is NULLABLE there is a hotspot:

  • we check each corresponding element at the index for nullability using Self::is_null_in_bitset() which is suboptimal (see the benchmarking results below)

What changes are included in this PR?

With this PR we change the flow of execution for nullable arrays:

  • reading raw bytes from the null bitmap and flipping them according to Spark vs Arrow logic
  • constructing BooleanBuffer from the bytes and then creating a NullBuffer
  • creating a PrimitiveArray (with type specified in the macro_rules! and inside the according methods in the SparkUnsafeArray for several dtypes) from slice data
  • appending this array to the current builder

How are these changes tested?

All of the basic tests pass.
Here is the benchmarking result (copied from the comment below).
Result from with_nulls version of benches (no_nulls wasnt affected with this PR)

array_type main (current) incoming PR time_reduce
i32 29.274μs 2.4256μs -91.71%
i64 27.260μs 4.7887μs -82.43%
f64 29.352μs 4.6621μs -84.11%
date32 26.863μs 2.5345μs -89.89%
timestamp 45.818μs 4.3659μs -90.47%

@sandugood

Copy link
Copy Markdown
Author

When running CI pipeline, got few UB.
It was caused presumably by the bit_offset and Spark <-> Arrow null bitmask tinkering.

Switched to a less optimal way, which still provided boost in performance. By materializing &[bool] and passing it to builder's append_values()

array_type main (current) incoming PR time_reduce
i32 30.030μs 21.250μs -29.23%
i64 27.572μs 21.996μs -20.22%
f64 30.059μs 21.889μs -27.17%
date32 28.693μs 21.041μs -26.66%
timestamp 46.832μs 21.888μs -53.26%

@sandugood

Copy link
Copy Markdown
Author

Could somebody potentially trigger the CI run? Thanks in advance

@andygrove

Copy link
Copy Markdown
Member

Thanks for picking this up. The shape of the change is exactly what #4626 was asking for: keep the runtime alignment check as load-bearing, and replace the per-element null branch with PrimitiveBuilder::append_values(values, &is_valid). The benchmark numbers look great.

A few things to look at, the first one is a correctness concern.

  1. Boolean nullable path may have UB. In append_booleans_to_builder, the new nullable path casts *const u8 to *const bool and iterates:

    let slice = unsafe { std::slice::from_raw_parts(self.element_offset as *const bool, num_elements) };
    for (idx, &value) in slice.iter().enumerate() { ... }

    Rust requires every bool value to be exactly 0 or 1. For null slots, the underlying byte is JVM-uninitialized memory, so the let &value materialization in the iterator pulls out a bool from arbitrary bytes which is UB even when the value is then ignored. The non-nullable branch a few lines down correctly uses *const u8 plus value != 0. Could you mirror that pattern in the nullable branch? cargo +nightly miri test would give a definitive answer, and it might be related to the UB you mentioned hitting earlier.

  2. Boolean nullable path still branches per-element. Once the cast is fixed, the loop still does a per-element null check, which is the shape perf: use aligned slice access in SparkUnsafeArray bulk append #4626 specifically called out as the bigger win for booleans. BooleanBuilder::append_values(values: &[bool], is_valid: Option<&[bool]>) would absorb the validity in bulk and remove the branch. Worth taking here if straightforward.

  3. Vec<bool> allocation per call. The simpler Vec<bool> form is fine for landing, but perf: use aligned slice access in SparkUnsafeArray bulk append #4626 suggested feeding a BooleanBuffer directly to the builder to skip the bool materialization. If you'd rather defer that, a follow-up issue with the link to the previous discussion would be helpful. At minimum a Vec::with_capacity(num_elements) instead of collect would avoid one growth.

  4. Stale assert message in append_dates_to_builder. The new debug_assert! message reads "append_timestamps: element_offset is null". Copy-paste from the timestamp path, should be append_dates.

  5. Comment phrasing on the alignment fallback. Could you reword the // Note: alignment is not guaranteed - that is why do this comments to point at the existing explanation at list.rs:105 and say why (nested-array variable-length region)? The prior PR was bounced because the description left the runtime check looking redundant, and a clear comment here is the easiest way to keep that from happening again.

  6. PR description. The benchmark numbers in your comment look like the right shape. Could you pull them into the PR description body and mention they came from cargo bench --bench array_element_append?

I'll approve the CI workflows so the matrix actually exercises the change. Out of curiosity, what was the UB you hit earlier? Worth knowing whether item 1 above was the cause or there's a separate hazard.


Disclosure: this review was assisted by an AI tool (Anthropic Claude via Claude Code). I read the diff, the linked issue, and the prior closed PR's review feedback before forming the comments above, and I take responsibility for them.

@sandugood

Copy link
Copy Markdown
Author

Thank you for the review @andygrove

I have changed my approach and bechmarked it.
My current approach is about using arrow internal types and buffers (essentially NullBuffer, from the underlying BooleanBufer, which is constructed from the bitmap, although with bits flipped), without materializing and looping through Vec<> which was still an improvement, but not a drastic one.

Posting the benchmarking results, which I ran using cargo bench --bench array_element_append. Posting result from with_nulls version of benches (no_nulls wasnt affected with this PR)

array_type main (current) incoming PR time_reduce
i32 29.274μs 2.4256μs -91.71%
i64 27.260μs 4.7887μs -82.43%
f64 29.352μs 4.6621μs -84.11%
date32 26.863μs 2.5345μs -89.89%
timestamp 45.818μs 4.3659μs -90.47%

cc @mbutrovich, if you could take a quick look at this, please. Thanks!

@andygrove

Copy link
Copy Markdown
Member

Thanks for the rework @sandugood, and for pulling the benchmark numbers and the cargo bench invocation into the description. The primitive, date, and timestamp paths are implemented the way #4626 was after. Reading the values as a slice and building a NullBuffer from the flipped Spark bitmap is the right shape, and the speedups are great.

One thing I want to call out as a nice catch: threading the timezone through and calling .with_timezone_opt(timezone) is actually load-bearing, not just defensive. PrimitiveBuilder::append_array asserts data_type equality (primitive_builder.rs:291), so without it the timestamp path would panic whenever the column carries a timezone. Good that you handled it.

A few items before this is ready:

1. The boolean nullable path is UB, and this revision moved it the wrong way. This is item 1 from the earlier review. The new nullable branch does:

let slice = unsafe { std::slice::from_raw_parts(self.element_offset as *const bool, num_elements) };
for (idx, &value) in slice.iter().enumerate() { ... }

Rust requires every bool to be exactly 0 or 1. For null slots the JVM byte is not guaranteed to be 0/1, and for (idx, &value) materializes the bool out of each byte even for slots that get skipped, which is UB. The previous version of this branch read *const u8 and did *ptr != 0, and the non-nullable branch right below still does the safe thing:

let values = unsafe { std::slice::from_raw_parts(self.element_offset as *const u8, num_elements) };
for &value in values { builder.append_value(value != 0); }

Could you mirror that *const u8 pattern in the nullable branch (builder.append_value(value != 0))? Since Spark stores booleans as one byte each rather than bit-packed, a per-element loop here is reasonable, so the u8 cast is really all that is needed. This is very likely the UB you hit in CI earlier, since there is a miri workflow. Worth confirming against a nullable boolean array with a null slot once the cast is in.

2. Copy-paste leftover in append_dates_to_builder. The debug_assert! message reads "append_timestamps: element_offset is null" and the comment above says contiguous i64 data, but this is the dates path. Both should say append_dates and i32.

3. A couple of targeted tests would help. The bitmap flip and the partial-byte handling are easy to get subtly wrong. Would you be open to adding unit tests for nullable arrays where num_elements is not a multiple of 8 with a null in the final byte (say 7, 9, 17), plus all-null and all-valid cases? That locks in the trailing-bit behavior, which currently relies on the BooleanBuffer length trimming the flipped padding bits.

4. Formatting. The process_primitive_lists macro in row.rs looks like the indentation drifted. A quick cargo fmt should clean it up.

5. Minor. The byte reinterpretation of the null word assumes little-endian bit ordering. That matches Arrow and every platform we target, so it is fine, but a one-line comment next to the null_words as *const u8 cast noting the little-endian assumption would help the next reader.

The boolean path is the only blocker. Everything else is small. Once that is a u8 read I think this is in good shape.

Disclosure: this review was assisted by an AI tool (Anthropic Claude via Claude Code). I read the diff, the linked issue, and the prior review thread before forming these comments, and I take responsibility for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: use aligned slice access in SparkUnsafeArray bulk append

2 participants