Skip to content

feat(parquet): fuse level encoding passes and compact level representation#9653

Draft
HippoBaro wants to merge 3 commits into
apache:mainfrom
HippoBaro:faster_sparse_columns_encoding
Draft

feat(parquet): fuse level encoding passes and compact level representation#9653
HippoBaro wants to merge 3 commits into
apache:mainfrom
HippoBaro:faster_sparse_columns_encoding

Conversation

@HippoBaro
Copy link
Copy Markdown
Contributor

@HippoBaro HippoBaro commented Apr 2, 2026

Which issue does this PR close?

Rationale for this change

See issue for details. The Parquet column writer currently does per-value work during level encoding regardless of data sparsity, even though the output encoding (RLE) is proportional to the number of runs.

What changes are included in this PR?

Three incremental commits, each building on the previous:

  1. Fuse level encoding with counting and histogram updates. write_mini_batch() previously made three separate passes over each level array: count non-nulls, update the level histogram, and RLE-encode. Now all three happen in a single pass via an observer callback on LevelEncoder. When the RLE encoder enters accumulation mode, the loop scans ahead for the full run length and batches the observer call. This makes counting and histogram updates O(1) per run.

  2. Batch consecutive null/empty rows in write_list. Consecutive null or empty list entries are now collapsed into a single visit_leaves() call that bulk-extends all leaf level buffers, instead of one tree traversal per null row. Mirrors the approach already used by write_struct().

  3. Short-circuit entirely-null columns. When every element in an array is null, skip Vec<i16> level-buffer materialization entirely and store a compact (def_value, rep_value, count) tuple. The writer encodes this via RleEncoder::put_n() in O(1) amortized time, bypassing the normal mini-batch loop.

Are these changes tested?

All tests passing. I added some benchmark to exercice the heavy and all-null code paths, alongside the existing 25% sparseness benchmarks:

Name                                 Before      After      Delta
primitive_all_null/default           37.5 ms     0.20 ms    (−99.5%)
primitive_all_null/zstd              37.1 ms     0.30 ms    (−99.2%)
primitive_sparse_99pct_null/default  42.5 ms     15.7 ms    (−62.9%)
primitive_sparse_99pct_null/p2       42.4 ms     15.9 ms    (−62.4%)
list_prim_sparse_99pct_null/default  40.8 ms     11.2 ms    (−72.4%)
list_prim_sparse_99pct_null/p2       40.8 ms     10.7 ms    (−73.8%)
bool/default                         12.7 ms     10.3 ms    (−18.7%)
primitive/default                   124.1 ms    104.6 ms    (−15.6%)
string_and_binary_view/default       46.3 ms     41.6 ms    (−10.1%)
list_primitive/default              253.9 ms    235.3 ms    (−7.4%)
string_dictionary/default            46.2 ms     43.8 ms    (−5.3%)

Non-nullable column benchmarks are within noise, as expected since they have no definition levels to optimize.

Are there any user-facing changes?

None.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Apr 2, 2026
@HippoBaro HippoBaro force-pushed the faster_sparse_columns_encoding branch from 335fb81 to 44dae05 Compare April 2, 2026 05:05
@HippoBaro
Copy link
Copy Markdown
Contributor Author

This is a continuation of the work done in #9447 to improve runtime performance around sparse and/or highly uniform columns. As such this may be of interest to @alamb and @etseidl.

5a1d3d7 adds three benchmarks that exercise the code path this series optimizes. I created a PR (#9654) to merge those separately if needed so the benchmark bot can have a baseline to compare against.

Thanks!

Comment thread parquet/src/encodings/rle.rs Outdated
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 2, 2026

Thanks @HippoBaro, this looks impressive. I'm still looking, but haven't found any obvious problems yet.

Gads, every time I delve this deep into parquet I go a little mad 😵‍💫. I think the RLE encoder could use a little refactoring/comment improvements to make the flow a little more obvious. Not as part of this PR though.

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a few comments. More tomorrow.

Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated
Comment thread parquet/src/column/writer/mod.rs Outdated
let mut values_to_write = 0usize;
let max_def = self.descr.max_def_level();
self.def_levels_encoder
.put_with_observer(levels, |level, count| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ When I added the histograms I wasn't happy with the redundancy here. Nice fix!

Comment thread parquet/src/encodings/levels.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated
Comment thread parquet/src/arrow/arrow_writer/levels.rs Outdated
@HippoBaro HippoBaro force-pushed the faster_sparse_columns_encoding branch from 7902e69 to c891c35 Compare April 8, 2026 21:16
@HippoBaro
Copy link
Copy Markdown
Contributor Author

HippoBaro commented Apr 8, 2026

Thanks for the reviews! I've reworked the branch to address all feedback. Sorry for the delay, it took me a while to experiment.

The main structural change is a LevelData enum refactor suggested by @jhorstmann. Thank you for the excellent suggestion. As I am primarily concerned with the performance of very sparse data, I hadn't considered the possibility to also speed up the non-null-but-uniform code path.

The Option<Vec<i16>> + uniform_levels: Option<(i16, i16, usize)> tuple is replaced by a single enum:

  enum LevelData {
      Absent,
      Materialized(Vec<i16>),
      Uniform { value: i16, count: usize },
  }

Absent replaces the previous None case, Uniform captures any column whose levels are a single repeated value (all-null, or nullable with no nulls), and Materialized is the normal vec path. This unifies the three states into one type and makes transitions between them easy to follow. This yields a nice performance improvement documented in ab9a7bc.

The resulting refactor has a larger LoC footprint, but the API is arguably much cleaner and robust.

Also, rebased as per #9656 (review)

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 8, 2026

Thanks @HippoBaro. I'll try to make some time to review the changes. Probably not today but hopefully tomorrow... 🤞

@HippoBaro HippoBaro changed the title feat(parquet): fuse level encoding passes and batch null runs in column writer feat(parquet): fuse level encoding passes and compact level representation Apr 8, 2026
@HippoBaro HippoBaro requested review from etseidl and jhorstmann April 8, 2026 22:29
@alamb

This comment has been minimized.

@adriangbot

This comment has been minimized.

@HippoBaro
Copy link
Copy Markdown
Contributor Author

@alamb The above results will include only parts of the benchmarks this code improves on. The rest are in #9679

alamb pushed a commit that referenced this pull request Apr 9, 2026
# Which issue does this PR close?

- None, but relates to #9653

# Rationale for this change

#9653 introduces optimizations related to non-null uniform workloads.
This adds benchmarks so we can quantify them.

# What changes are included in this PR?

Add three new benchmark cases to the arrow_writer benchmark suite for
evaluating write performance on struct columns at varying null
densities:

* `struct_non_null`: a nullable struct with 0% null rows and
non-nullable primitive children;
* `struct_sparse_99pct_null`: a nullable struct with 99% null rows,
exercising null batching through one level of struct nesting;
* `struct_all_null`: a nullable struct with 100% null rows, exercising
the uniform-null path through struct nesting.

Baseline results (Apple M1 Max):
```
  struct_non_null/default              29.9 ms
  struct_non_null/parquet_2            38.2 ms
  struct_non_null/zstd_parquet_2       50.9 ms
  struct_sparse_99pct_null/default      7.2 ms
  struct_sparse_99pct_null/parquet_2    7.3 ms
  struct_sparse_99pct_null/zstd_p2      8.1 ms
  struct_all_null/default              83.3 µs
  struct_all_null/parquet_2            82.5 µs
  struct_all_null/zstd_parquet_2      106.6 µs
```

# Are these changes tested?

N/A

# Are there any user-facing changes?

None

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 9, 2026

@alamb The above results will include only parts of the benchmarks this code improves on. The rest are in #9679

I merged it in and merged up from main and will rerun the benchmarks

@alamb

This comment has been minimized.

@adriangbot

This comment has been minimized.

@adriangbot

This comment has been minimized.

@HippoBaro
Copy link
Copy Markdown
Contributor Author

I am surprised by the few regressions above, such as:

string_dictionary/parquet_2                        1.55     85.9±0.30ms     3.0 GB/sec    1.00     55.4±0.18ms     4.7 GB/sec

I can't reproduce these locally. I get:

string_dictionary/parquet_2
                        time:   [53.024 ms 53.574 ms 54.565 ms]
                        thrpt:  [4.7271 GiB/s 4.8146 GiB/s 4.8646 GiB/s]
                 change:
                        time:   [−3.0644% −1.9407% −0.1309%] (p = 0.01 < 0.05)
                        thrpt:  [+0.1311% +1.9791% +3.1613%]
                        Change within noise threshold.

Are these known to be noisy?

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 9, 2026

I am surprised by the few regressions above, such as:


string_dictionary/parquet_2                        1.55     85.9±0.30ms     3.0 GB/sec    1.00     55.4±0.18ms     4.7 GB/sec

I can't reproduce these locally. I get:

Are these known to be noisy?

Yes. They are extremely twitchy. I always take them with a grain of salt or ten. 😅

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 9, 2026

I've now run multiple passes of the arrow_writer bench on my workstation and there appear to be no regressions due to this PR. And the speed ups are quite impressive 😄

Details
group                                              levels                                 main
-----                                              ------                                 ----
bool/bloom_filter                                  1.00     12.9±0.09ms    19.4 MB/sec    1.00     12.8±0.12ms    19.5 MB/sec
bool/default                                       1.00      8.5±0.06ms    29.3 MB/sec    1.00      8.5±0.09ms    29.3 MB/sec
bool/parquet_2                                     1.01     11.2±0.18ms    22.2 MB/sec    1.00     11.1±0.17ms    22.5 MB/sec
bool/zstd                                          1.00      9.0±0.10ms    27.8 MB/sec    1.00      9.0±0.10ms    27.9 MB/sec
bool/zstd_parquet_2                                1.01     11.5±0.08ms    21.7 MB/sec    1.00     11.4±0.10ms    21.9 MB/sec
bool_non_null/bloom_filter                         1.02      8.6±0.04ms    14.6 MB/sec    1.00      8.4±0.03ms    14.8 MB/sec
bool_non_null/default                              1.05      2.9±0.01ms    42.4 MB/sec    1.00      2.8±0.04ms    44.4 MB/sec
bool_non_null/parquet_2                            1.02      6.2±0.04ms    20.1 MB/sec    1.00      6.1±0.03ms    20.6 MB/sec
bool_non_null/zstd                                 1.05      3.3±0.04ms    38.2 MB/sec    1.00      3.1±0.06ms    40.1 MB/sec
bool_non_null/zstd_parquet_2                       1.02      6.5±0.06ms    19.1 MB/sec    1.00      6.4±0.04ms    19.5 MB/sec
float_with_nans/bloom_filter                       1.00     81.2±0.69ms   172.4 MB/sec    1.08     87.7±0.42ms   159.7 MB/sec
float_with_nans/default                            1.00     58.0±0.86ms   241.4 MB/sec    1.08     62.8±0.28ms   222.9 MB/sec
float_with_nans/parquet_2                          1.00     71.6±1.10ms   195.6 MB/sec    1.07     76.9±0.49ms   182.2 MB/sec
float_with_nans/zstd                               1.00     88.6±0.36ms   158.0 MB/sec    1.07     94.6±0.36ms   148.0 MB/sec
float_with_nans/zstd_parquet_2                     1.00    101.4±0.80ms   138.1 MB/sec    1.06    107.9±0.96ms   129.7 MB/sec
list_primitive/bloom_filter                        1.06    319.5±1.83ms  1707.2 MB/sec    1.00    302.6±2.73ms  1802.0 MB/sec
list_primitive/default                             1.07    260.7±1.76ms     2.0 GB/sec    1.00    242.8±1.50ms     2.2 GB/sec
list_primitive/parquet_2                           1.00    257.0±1.68ms     2.1 GB/sec    1.00    257.5±3.19ms     2.1 GB/sec
list_primitive/zstd                                1.01    390.4±2.65ms  1397.1 MB/sec    1.00    388.3±3.31ms  1404.6 MB/sec
list_primitive/zstd_parquet_2                      1.03    387.2±2.82ms  1408.4 MB/sec    1.00    374.4±4.46ms  1456.7 MB/sec
list_primitive_non_null/bloom_filter               1.00    354.2±6.61ms  1536.5 MB/sec    1.02    360.1±4.36ms  1511.5 MB/sec
list_primitive_non_null/default                    1.00    262.5±7.11ms     2.0 GB/sec    1.01    265.3±5.08ms     2.0 GB/sec
list_primitive_non_null/parquet_2                  1.00    264.3±4.69ms     2.0 GB/sec    1.07    283.5±7.82ms  1919.6 MB/sec
list_primitive_non_null/zstd                       1.01   527.5±10.36ms  1031.7 MB/sec    1.00   520.9±19.26ms  1044.7 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    510.5±7.07ms  1066.1 MB/sec    1.00   509.9±13.27ms  1067.4 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00      9.2±0.06ms     4.0 GB/sec    3.15     29.0±0.24ms  1288.8 MB/sec
list_primitive_sparse_99pct_null/default           1.00      8.7±0.08ms     4.2 GB/sec    3.30     28.6±0.64ms  1304.7 MB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00      8.7±0.07ms     4.2 GB/sec    3.28     28.5±0.40ms  1310.8 MB/sec
list_primitive_sparse_99pct_null/zstd              1.00     10.3±0.10ms     3.5 GB/sec    2.91     29.9±0.21ms  1248.5 MB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00      8.8±0.10ms     4.1 GB/sec    3.22     28.4±0.25ms  1315.2 MB/sec
primitive/bloom_filter                             1.00    128.9±0.80ms   348.2 MB/sec    1.02    132.0±1.05ms   339.9 MB/sec
primitive/default                                  1.00     84.9±1.59ms   528.8 MB/sec    1.02     86.7±0.67ms   517.5 MB/sec
primitive/parquet_2                                1.00     94.6±1.36ms   474.4 MB/sec    1.02     96.9±0.76ms   463.2 MB/sec
primitive/zstd                                     1.00    104.0±0.78ms   431.6 MB/sec    1.03    107.2±1.27ms   418.5 MB/sec
primitive/zstd_parquet_2                           1.00    117.0±1.62ms   383.4 MB/sec    1.03    120.0±0.74ms   373.9 MB/sec
primitive_all_null/bloom_filter                    1.00   1058.5±6.49µs    41.4 GB/sec    18.25    19.3±0.10ms     2.3 GB/sec
primitive_all_null/default                         1.00    198.3±1.38µs   221.0 GB/sec    92.92    18.4±0.06ms     2.4 GB/sec
primitive_all_null/parquet_2                       1.00    200.9±1.97µs   218.2 GB/sec    91.94    18.5±0.09ms     2.4 GB/sec
primitive_all_null/zstd                            1.00    341.9±1.60µs   128.2 GB/sec    54.27    18.6±0.07ms     2.4 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    317.2±1.37µs   138.2 GB/sec    58.48    18.5±0.08ms     2.4 GB/sec  
primitive_non_null/bloom_filter                    1.00     94.8±1.16ms   464.1 MB/sec    1.10    103.9±0.44ms   423.5 MB/sec
primitive_non_null/default                         1.00     38.5±0.22ms  1141.6 MB/sec    1.16     44.8±0.22ms   982.8 MB/sec
primitive_non_null/parquet_2                       1.00     52.7±0.51ms   834.4 MB/sec    1.13     59.4±1.01ms   740.3 MB/sec
primitive_non_null/zstd                            1.00     59.2±0.37ms   743.6 MB/sec    1.13     66.8±0.62ms   658.6 MB/sec
primitive_non_null/zstd_parquet_2                  1.00     76.0±0.98ms   579.1 MB/sec    1.11     84.1±1.49ms   523.2 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.00     12.9±0.27ms     3.4 GB/sec    2.23     28.8±0.70ms  1557.2 MB/sec
primitive_sparse_99pct_null/default                1.00     11.3±1.85ms     3.9 GB/sec    2.35     26.6±0.32ms  1686.3 MB/sec
primitive_sparse_99pct_null/parquet_2              1.00     11.6±1.71ms     3.8 GB/sec    2.30     26.8±0.28ms  1672.7 MB/sec
primitive_sparse_99pct_null/zstd                   1.00     13.8±0.14ms     3.2 GB/sec    2.13     29.4±0.29ms  1528.3 MB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.00     12.4±0.06ms     3.5 GB/sec    2.27     28.1±0.28ms  1595.2 MB/sec
string/bloom_filter                                1.00   169.3±11.30ms     3.0 GB/sec    1.05   178.1±13.47ms     2.9 GB/sec
string/default                                     1.05   121.8±12.92ms     4.2 GB/sec    1.00    116.3±3.32ms     4.4 GB/sec
string/parquet_2                                   1.03    120.8±6.66ms     4.2 GB/sec    1.00    117.6±1.10ms     4.4 GB/sec
string/zstd                                        1.00    308.2±4.22ms  1701.1 MB/sec    1.03   317.4±13.62ms  1651.6 MB/sec
string/zstd_parquet_2                              1.01    287.9±2.18ms  1821.2 MB/sec    1.00    284.1±1.61ms  1845.6 MB/sec
string_and_binary_view/bloom_filter                1.00     48.8±0.29ms   661.4 MB/sec    1.01     49.3±0.35ms   654.5 MB/sec
string_and_binary_view/default                     1.00     34.6±0.27ms   932.2 MB/sec    1.00     34.5±0.32ms   934.9 MB/sec
string_and_binary_view/parquet_2                   1.01     43.9±0.28ms   734.1 MB/sec    1.00     43.7±0.31ms   738.4 MB/sec
string_and_binary_view/zstd                        1.00     61.1±0.34ms   528.0 MB/sec    1.00     61.3±1.04ms   526.1 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     53.6±0.63ms   601.6 MB/sec    1.00     53.6±0.58ms   602.2 MB/sec
string_dictionary/bloom_filter                     1.00     76.4±0.63ms     3.4 GB/sec    1.42    108.4±0.44ms     2.4 GB/sec
string_dictionary/default                          1.00     51.7±0.24ms     5.0 GB/sec    1.58     81.8±0.34ms     3.2 GB/sec
string_dictionary/parquet_2                        1.00     55.5±0.66ms     4.6 GB/sec    1.50     83.5±0.55ms     3.1 GB/sec
string_dictionary/zstd                             1.00    150.0±1.17ms  1760.5 MB/sec    1.08    162.3±7.72ms  1627.8 MB/sec
string_dictionary/zstd_parquet_2                   1.00    142.7±0.88ms  1850.4 MB/sec    1.00    142.7±1.09ms  1850.5 MB/sec
string_non_null/bloom_filter                       1.00    191.4±1.91ms     2.7 GB/sec    1.09    208.4±8.39ms     2.5 GB/sec
string_non_null/default                            1.00    126.2±1.83ms     4.1 GB/sec    1.13    142.0±7.93ms     3.6 GB/sec
string_non_null/parquet_2                          1.00    137.1±2.30ms     3.7 GB/sec    1.00    137.7±1.85ms     3.7 GB/sec
string_non_null/zstd                               1.00    378.5±1.99ms  1384.4 MB/sec    1.06    400.3±7.49ms  1309.0 MB/sec
string_non_null/zstd_parquet_2                     1.00    359.4±2.26ms  1458.0 MB/sec    1.04    372.0±7.03ms  1408.5 MB/sec
struct_all_null/bloom_filter                       1.00    452.8±3.14µs    34.8 GB/sec    17.39     7.9±0.04ms  2047.7 MB/sec
struct_all_null/default                            1.00     85.5±0.63µs   184.1 GB/sec    87.80     7.5±0.04ms     2.1 GB/sec
struct_all_null/parquet_2                          1.00     86.5±1.38µs   182.0 GB/sec    86.71     7.5±0.03ms     2.1 GB/sec
struct_all_null/zstd                               1.00    146.8±1.12µs   107.3 GB/sec    51.77     7.6±0.09ms     2.1 GB/sec
struct_all_null/zstd_parquet_2                     1.00    136.4±1.14µs   115.4 GB/sec    55.50     7.6±0.06ms     2.1 GB/sec
struct_non_null/bloom_filter                       1.00     41.0±0.59ms   390.6 MB/sec    1.29     53.0±0.27ms   301.8 MB/sec
struct_non_null/default                            1.00     17.7±0.12ms   901.8 MB/sec    1.59     28.2±0.16ms   567.4 MB/sec
struct_non_null/parquet_2                          1.00     23.3±0.13ms   686.6 MB/sec    1.46     34.1±0.20ms   469.3 MB/sec
struct_non_null/zstd                               1.00     24.3±0.13ms   658.0 MB/sec    1.44     35.1±0.22ms   455.8 MB/sec
struct_non_null/zstd_parquet_2                     1.00     33.6±0.19ms   476.6 MB/sec    1.31     44.1±0.46ms   363.0 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      5.9±0.04ms     2.7 GB/sec    2.11     12.4±0.15ms  1303.5 MB/sec
struct_sparse_99pct_null/default                   1.00      5.0±0.04ms     3.2 GB/sec    2.32     11.6±0.11ms  1393.1 MB/sec
struct_sparse_99pct_null/parquet_2                 1.00      5.0±0.03ms     3.2 GB/sec    2.32     11.6±0.13ms  1393.7 MB/sec
struct_sparse_99pct_null/zstd                      1.00      6.2±0.04ms     2.6 GB/sec    2.07     12.8±0.19ms  1264.3 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      5.6±0.03ms     2.8 GB/sec    2.16     12.1±0.13ms  1330.1 MB/sec

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented Apr 9, 2026

@kszucs do you have time to look at this PR? It touches on your CDC code.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 13, 2026

I am hoping to review this tomorrow

@github-actions github-actions Bot added the arrow Changes to the arrow crate label May 10, 2026
@HippoBaro HippoBaro force-pushed the faster_sparse_columns_encoding branch from fdc9bbe to 8252f31 Compare May 10, 2026 06:53
alamb pushed a commit that referenced this pull request May 14, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from #9653 
- Contributes to #9731

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

See #9731

# What changes are included in this PR?

When an entire list, struct, fixed-size list, or leaf array is null,
skip per-row iteration and emit bulk uniform def/rep levels via
`extend_uniform_levels` in O(1).

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

All tests passing + additional all null unit tests.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
alamb pushed a commit that referenced this pull request May 20, 2026
## Which issue does this PR close?

- Contributes to #9731.

## AI assistance

Implementation drafted with AI assistance and iterated against the
benchmarks below. I've reviewed and own the code, including the gate
threshold which I picked from the sweep in [Threshold
(`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's
[CONTRIBUTING guidance on AI-generated
submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions).

## Rationale for this change

When writing a nullable leaf (primitive) Arrow array, `write_leaf`
builds the definition-level buffer one element at a time, mapping each
null bit to a level. For columns that are mostly null this does
~`num_rows` of branchy work and allocates a `num_rows`-element level
buffer even though almost every produced level is the same value. #9954
adds an O(1) fast path for the *entirely* null case; this PR covers the
*sparse* (mostly-but-not-entirely null) case it doesn't handle, the
literal subject of #9731 ("a column that is 99% null … ~100x more work
than necessary").

## What changes are included in this PR?

A single popcount pass over the null mask
(`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid
values in the range. When the slice is majority-null, the
definition-level buffer is bulk-filled with the null level (a vectorized
`Vec::resize` memset) and only the non-null positions (from
`NullBuffer::valid_indices()`) are overwritten. The existing per-row
path is kept for non-majority-null slices, so balanced and null-light
columns are unaffected. Both branches share the same `let range_nulls =
nulls.slice(range.start, len)` slicing idiom; the slow path uses
`range_nulls.iter()` for the def-level map and
`range_nulls.valid_indices().map(|i| i + range.start)` for
`non_null_indices`, with no `unsafe`. Output is byte-identical: the
level *values* are unchanged, just produced via memset+scatter (fast
path) or via the high-level `NullBuffer` iterators (slow path) instead
of a manual `BitIndexIterator` walk.

## Threshold (`BULK_FILL_MIN_LEN`)

The bulk-fill fast path is gated on two conditions:

- `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call
slice/popcount/iterator overhead only amortizes on sizable sub-ranges.
List/struct paths call `write_leaf` many times with tiny ranges (avg
list length 1-5); paying any per-call popcount there would regress them.
A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X
shows the regression floor settles by T=32, and the choice of 64 gives
~12x margin over the average list length without losing the
flat-primitive wins.
- `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is
O(1), so this check is free. We use the buffer-wide density as a
heuristic for the sub-range; for full-array writes (the primary target,
flat primitive columns) it's exact.

Even when the gate skips the fast path, evaluating it across
high-frequency call sites (~10K calls in some list benchmarks) is a
small structural cost (~1-2% on list-sparse cases). The wins on the
targeted shapes (-35% sparse-primitive, -66% all-null primitive) far
outweigh that. Reducing the cost further would require hoisting the
decision into the caller.

## Are these changes tested?

Existing tests cover this path: `cargo test -p parquet --features arrow
--lib arrow_writer` is green (136 tests, full of nulls and roundtrips);
full `cargo test -p parquet --features arrow` green modulo the
pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on
`main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt
--check` clean. The `unsafe get_unchecked_mut` flagged in the original
revision was replaced via `NullBuffer::valid_indices()`; the slow-path
also dropped its `unsafe value_unchecked` for the same reason.

## Are there any user-facing changes?

None.

## Benchmarks

`cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable
primitive columns, local Ryzen 9 9950X:

```
primitive_sparse_99pct_null/default   11.88 ms -> 9.13 ms   (-23%)   <- the case #9731 calls out
primitive_all_null/default             5.65 ms -> 2.33 ms   (-59%)   (subsumed by #9954's O(1) path if that lands first)
struct_sparse_99pct_null/default       5.67 ms -> 5.32 ms   (-6%)
struct_all_null/default                1.52 ms -> 1.31 ms   (-14%)
list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string:  within noise (no regression)
```

The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the
post-fixup revision shows the same shape with stronger relative wins on
the targeted cases:

```
primitive_all_null/default              2.47x (11.0ms -> 4.4ms)
primitive_sparse_99pct_null/default     1.60x (16.8ms -> 10.5ms)
primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2}    1.38x to 2.48x
primitive_sparse_99pct_null/{...}        1.28x to 1.59x
list_primitive*, list_primitive_sparse_99pct_null*:                    1.00x to 1.01x (within noise)
```

Microbench of the definition-level fill in isolation: 10.3x @ 100%-null,
8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈
12-15% null, clean win above ~25%; the `>= 50% null` guard is
conservative.

This is the *materialization*-cost half of #9731 (~30% of the 99%-null
write); the *walk*-cost half, a run-length input to the level encoder so
the column writer doesn't even iterate all `num_rows` levels, is the
larger structural change #9653 is heading toward. This PR is
deliberately small and isolated so it lands independently of and rebases
cleanly under that work.

---------

Co-authored-by: Ryan Stewart <noreply@example.com>
}

/// Computes the min and max for the provided values
#[inline]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you foudn that inlining actually helps in all these cases? We have found that in some cases inlining actually makes the performance worse (as there are a bunch of optimizations in LLVM that are disabled once the function gets too big)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't. For these small methods I tend to default to #[inline] out of habit (and that seems to be common practice in this code base AFAICT.) I'll do some benchmarking to test that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on string/default I can't measure any difference with and without forced inlining. I'm removing it 👍

}
}

/// Zero-allocation iterator over the indices in a [`ValueSelection`].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this now has dispatch overhead for each item (the match self) -- we could probably make it even faster by changing the callsite to instantiate different loops based on the different iterators rather than checking each value 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum that's great feedback let me iterate on that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased and updated the code so that the call sites now dispatch once on ValueSelectionRef and then run the appropriate loop. Thanks!

alamb pushed a commit that referenced this pull request May 20, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from #9653 
- Contributes to #9731

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

See #9731

# What changes are included in this PR?

Changes `byte_array` encoder methods (`FallbackEncoder::encode`,
`DictEncoder::encode`, etc) and all `get_*_array_slice` functions from
`&[usize]` to `impl ExactSizeIterator<Item = usize>`.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

All tests passing.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
HippoBaro added 3 commits May 22, 2026 00:07
Introduce `ArrayLevelsView` and use it when writing content-defined
chunks, so CDC chunking can borrow sliced definition levels, repetition
levels, and value selections while keeping the original Arrow leaf
array.

This avoids constructing per-chunk ArrayLevels instances and makes
chunked writes operate on views of the existing level data.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Thread value selection through level building and column writing as
either empty, dense, or sparse instead of always materializing non-null
indices.

This lets dense selections use the regular `write` path with an offset
and length, while sparse selections continue to use `write_gather`. It
avoids unnecessary index allocation for common contiguous cases,
preserves the all-null fast path, and keeps nullable/sparse writes
explicit at the writer boundary.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
The dense column-writer path lets primitive arrays avoid materializing
value indices and write contiguous values directly. Byte arrays still
missed most of that win because their generic writer path had to call
`ArrayAccessor::value` for each logical row. For `Utf8`/`Binary` arrays,
each value access re-enters the Arrow array abstraction, loads adjacent
offsets, bounds-checks, and constructs a slice one value at a time.

That overhead is specific to offset-backed byte arrays: unlike primitive
arrays, the physical values are not fixed-width slots where a dense
range maps directly to a typed slice. The data lives in one values
buffer, with per-row boundaries stored in an offsets buffer.

Add a dense byte-array path for `Utf8`, `LargeUtf8`, `Binary`, and
`LargeBinary` that walks those Arrow offsets and values buffers
directly. This preserves the existing sparse/generic path for
null-filtered writes, views, dictionaries, and fixed-size binary, while
making the common non-null contiguous case behave like the optimized
primitive dense path.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@HippoBaro HippoBaro force-pushed the faster_sparse_columns_encoding branch from 8252f31 to 02c964f Compare May 22, 2026 04:44
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 22, 2026

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4213853435-293-hvscn 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing faster_sparse_columns_encoding (02c964f) to 4b80f0e (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4521692216-295-lhkfq 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing faster_sparse_columns_encoding (02c964f) to 4b80f0e (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4214179774-294-fpw97 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing faster_sparse_columns_encoding (02c964f) to 4b80f0e (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              faster_sparse_columns_encoding         main
-----                                              ------------------------------         ----
bool/bloom_filter                                  1.00     13.0±0.07ms    19.2 MB/sec    1.01     13.2±0.08ms    18.9 MB/sec
bool/cdc                                           1.00     15.7±0.04ms    15.9 MB/sec    1.03     16.2±0.16ms    15.4 MB/sec
bool/default                                       1.00     10.9±0.03ms    22.9 MB/sec    1.02     11.1±0.06ms    22.4 MB/sec
bool/parquet_2                                     1.00     14.6±0.07ms    17.1 MB/sec    1.02     15.0±0.07ms    16.7 MB/sec
bool/zstd                                          1.00     11.3±0.03ms    22.0 MB/sec    1.03     11.7±0.07ms    21.5 MB/sec
bool/zstd_parquet_2                                1.00     15.0±0.03ms    16.7 MB/sec    1.02     15.4±0.09ms    16.3 MB/sec
bool_non_null/bloom_filter                         1.00      6.8±0.03ms    18.4 MB/sec    1.03      7.0±0.03ms    17.7 MB/sec
bool_non_null/cdc                                  1.00      6.5±0.04ms    19.3 MB/sec    1.07      6.9±0.14ms    18.1 MB/sec
bool_non_null/default                              1.00      4.0±0.03ms    31.0 MB/sec    1.06      4.3±0.02ms    29.2 MB/sec
bool_non_null/parquet_2                            1.00      8.7±0.06ms    14.3 MB/sec    1.04      9.1±0.05ms    13.8 MB/sec
bool_non_null/zstd                                 1.00      4.4±0.03ms    28.5 MB/sec    1.06      4.6±0.02ms    27.0 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.1±0.04ms    13.7 MB/sec    1.03      9.5±0.04ms    13.2 MB/sec
float_with_nans/bloom_filter                       1.00     92.0±1.75ms   152.2 MB/sec    1.02     94.0±2.35ms   149.0 MB/sec
float_with_nans/cdc                                1.00     79.6±1.18ms   176.0 MB/sec    1.02     81.5±0.22ms   171.7 MB/sec
float_with_nans/default                            1.00     74.5±1.24ms   187.8 MB/sec    1.02     75.7±1.38ms   184.9 MB/sec
float_with_nans/parquet_2                          1.00     93.7±1.78ms   149.5 MB/sec    1.01     94.5±0.48ms   148.2 MB/sec
float_with_nans/zstd                               1.00    110.3±1.04ms   126.9 MB/sec    1.02    112.7±1.01ms   124.2 MB/sec
float_with_nans/zstd_parquet_2                     1.00    133.2±1.49ms   105.1 MB/sec    1.00    133.8±2.23ms   104.6 MB/sec
list_primitive/bloom_filter                        1.11   377.6±14.34ms  1444.3 MB/sec    1.00   340.4±11.68ms  1602.1 MB/sec
list_primitive/cdc                                 1.14    419.6±4.95ms  1299.7 MB/sec    1.00    368.0±5.01ms  1481.8 MB/sec
list_primitive/default                             1.13    292.1±4.91ms  1867.1 MB/sec    1.00    257.7±4.16ms     2.1 GB/sec
list_primitive/parquet_2                           1.12    314.2±3.86ms  1735.8 MB/sec    1.00    280.0±2.63ms  1947.4 MB/sec
list_primitive/zstd                                1.07    546.7±7.06ms   997.6 MB/sec    1.00    509.4±6.40ms  1070.5 MB/sec
list_primitive/zstd_parquet_2                      1.07    537.4±4.04ms  1014.8 MB/sec    1.00    502.3±3.14ms  1085.8 MB/sec
list_primitive_non_null/bloom_filter               1.00   409.9±18.71ms  1327.7 MB/sec    1.04   424.5±15.86ms  1281.9 MB/sec
list_primitive_non_null/cdc                        1.00   417.2±15.50ms  1304.6 MB/sec    1.05   440.0±10.57ms  1236.8 MB/sec
list_primitive_non_null/default                    1.00    267.8±5.61ms  2031.9 MB/sec    1.11   297.7±10.41ms  1827.9 MB/sec
list_primitive_non_null/parquet_2                  1.00   294.7±14.82ms  1846.7 MB/sec    1.04   307.2±14.97ms  1771.8 MB/sec
list_primitive_non_null/zstd                       1.00    696.5±8.15ms   781.4 MB/sec    1.04   721.9±11.61ms   753.9 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    686.1±9.56ms   793.2 MB/sec    1.00    688.1±6.43ms   791.0 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.05     12.4±0.66ms     2.9 GB/sec    1.00     11.8±0.05ms     3.1 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     23.6±0.37ms  1584.1 MB/sec    1.02     24.0±0.20ms  1554.8 MB/sec
list_primitive_sparse_99pct_null/default           1.00     12.0±0.49ms     3.0 GB/sec    1.00     11.9±0.22ms     3.1 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     11.5±0.30ms     3.2 GB/sec    1.01     11.6±0.19ms     3.1 GB/sec
list_primitive_sparse_99pct_null/zstd              1.00     13.2±0.05ms     2.8 GB/sec    1.03     13.6±0.33ms     2.7 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.02     11.8±0.35ms     3.1 GB/sec    1.00     11.6±0.06ms     3.1 GB/sec
primitive/bloom_filter                             1.02    154.0±2.37ms   291.5 MB/sec    1.00    151.1±3.24ms   296.9 MB/sec
primitive/cdc                                      1.00    158.8±2.08ms   282.5 MB/sec    1.03    163.0±1.63ms   275.4 MB/sec
primitive/default                                  1.00    117.4±0.31ms   382.3 MB/sec    1.02    119.8±2.10ms   374.5 MB/sec
primitive/parquet_2                                1.00    133.4±1.92ms   336.4 MB/sec    1.02    135.8±1.39ms   330.5 MB/sec
primitive/zstd                                     1.00    147.8±1.87ms   303.7 MB/sec    1.01    149.9±2.24ms   299.4 MB/sec
primitive/zstd_parquet_2                           1.01    168.8±1.61ms   265.9 MB/sec    1.00    167.6±1.73ms   267.7 MB/sec
primitive_all_null/bloom_filter                    1.00    886.8±3.60µs    49.4 GB/sec    1.02   901.8±27.81µs    48.6 GB/sec
primitive_all_null/cdc                             1.00     18.3±0.20ms     2.4 GB/sec    1.03     18.8±0.32ms     2.3 GB/sec
primitive_all_null/default                         1.00    275.0±1.33µs   159.3 GB/sec    1.00    274.5±1.11µs   159.7 GB/sec
primitive_all_null/parquet_2                       1.00    272.5±0.85µs   160.8 GB/sec    1.02    279.2±1.59µs   156.9 GB/sec
primitive_all_null/zstd                            1.00    384.9±1.55µs   113.8 GB/sec    1.01    389.0±1.23µs   112.7 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    350.7±1.05µs   125.0 GB/sec    1.02    356.4±1.98µs   123.0 GB/sec
primitive_non_null/bloom_filter                    1.00     98.2±0.31ms   448.0 MB/sec    1.08    106.4±1.68ms   413.7 MB/sec
primitive_non_null/cdc                             1.00     81.1±0.37ms   542.5 MB/sec    1.11     89.9±0.57ms   489.4 MB/sec
primitive_non_null/default                         1.00     59.5±0.16ms   739.9 MB/sec    1.13     67.0±0.32ms   657.0 MB/sec
primitive_non_null/parquet_2                       1.00     82.1±0.41ms   535.8 MB/sec    1.08     88.5±0.29ms   497.0 MB/sec
primitive_non_null/zstd                            1.00     91.0±1.29ms   483.7 MB/sec    1.16    105.9±1.58ms   415.4 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    115.3±0.98ms   381.7 MB/sec    1.13    129.8±3.16ms   339.0 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.07     12.8±0.79ms     3.4 GB/sec    1.00     12.0±0.18ms     3.7 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     30.9±0.10ms  1453.6 MB/sec    1.05     32.3±0.44ms  1387.8 MB/sec
primitive_sparse_99pct_null/default                1.00     10.7±0.12ms     4.1 GB/sec    1.04     11.1±0.29ms     3.9 GB/sec
primitive_sparse_99pct_null/parquet_2              1.04     11.0±0.40ms     4.0 GB/sec    1.00     10.5±0.05ms     4.2 GB/sec
primitive_sparse_99pct_null/zstd                   1.00     14.0±0.05ms     3.1 GB/sec    1.01     14.1±0.40ms     3.1 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.02     12.7±0.06ms     3.5 GB/sec    1.00     12.5±0.09ms     3.5 GB/sec
string/bloom_filter                                1.00   221.8±23.69ms     2.3 GB/sec    1.08   240.0±28.62ms     2.1 GB/sec
string/cdc                                         1.00    217.0±6.77ms     2.4 GB/sec    1.04    225.0±7.60ms     2.3 GB/sec
string/default                                     1.09   141.7±21.23ms     3.6 GB/sec    1.00   130.3±23.68ms     3.9 GB/sec
string/parquet_2                                   1.10    124.1±1.15ms     4.1 GB/sec    1.00    113.0±8.05ms     4.5 GB/sec
string/zstd                                        1.05   445.5±20.12ms  1176.9 MB/sec    1.00    422.4±7.49ms  1241.1 MB/sec
string/zstd_parquet_2                              1.00    397.4±3.06ms  1319.2 MB/sec    1.02    404.4±6.39ms  1296.4 MB/sec
string_and_binary_view/bloom_filter                1.03     70.8±2.52ms   455.7 MB/sec    1.00     68.6±1.45ms   470.4 MB/sec
string_and_binary_view/cdc                         1.07     62.8±0.83ms   513.8 MB/sec    1.00     58.8±0.24ms   548.9 MB/sec
string_and_binary_view/default                     1.06     51.2±0.90ms   629.7 MB/sec    1.00     48.4±0.17ms   666.6 MB/sec
string_and_binary_view/parquet_2                   1.01     62.1±1.51ms   519.0 MB/sec    1.00     61.3±1.35ms   526.0 MB/sec
string_and_binary_view/zstd                        1.02     87.5±1.42ms   368.4 MB/sec    1.00     86.0±1.68ms   375.2 MB/sec
string_and_binary_view/zstd_parquet_2              1.01     75.3±0.11ms   428.5 MB/sec    1.00     74.2±1.55ms   434.7 MB/sec
string_dictionary/bloom_filter                     1.00     91.1±5.21ms     2.8 GB/sec    1.05     95.7±6.85ms     2.7 GB/sec
string_dictionary/cdc                              1.00     53.4±1.83ms     4.8 GB/sec    1.04     55.7±3.59ms     4.6 GB/sec
string_dictionary/default                          1.06     52.2±2.28ms     4.9 GB/sec    1.00     49.1±2.04ms     5.3 GB/sec
string_dictionary/parquet_2                        1.00     53.7±0.37ms     4.8 GB/sec    1.04     55.8±1.96ms     4.6 GB/sec
string_dictionary/zstd                             1.00    207.1±2.24ms  1275.4 MB/sec    1.01    209.7±3.70ms  1259.8 MB/sec
string_dictionary/zstd_parquet_2                   1.00    200.6±2.12ms  1316.6 MB/sec    1.00    201.3±1.62ms  1311.8 MB/sec
string_non_null/bloom_filter                       1.00   231.2±12.36ms     2.2 GB/sec    1.19   274.9±23.54ms  1906.4 MB/sec
string_non_null/cdc                                1.00    264.2±9.74ms  1983.3 MB/sec    1.03    271.6±9.12ms  1929.0 MB/sec
string_non_null/default                            1.00    110.8±9.85ms     4.6 GB/sec    1.29   143.2±14.28ms     3.6 GB/sec
string_non_null/parquet_2                          1.00    132.2±8.47ms     3.9 GB/sec    1.15    151.8±4.80ms     3.4 GB/sec
string_non_null/zstd                               1.00    544.7±8.06ms   962.0 MB/sec    1.04   568.4±12.03ms   921.9 MB/sec
string_non_null/zstd_parquet_2                     1.00    512.4±5.88ms  1022.7 MB/sec    1.01    517.3±5.89ms  1012.9 MB/sec
struct_all_null/bloom_filter                       1.00    377.6±6.08µs    41.7 GB/sec    1.00    376.3±5.04µs    41.8 GB/sec
struct_all_null/cdc                                1.01      7.8±0.53ms     2.0 GB/sec    1.00      7.7±0.14ms     2.1 GB/sec
struct_all_null/default                            1.00    117.8±0.55µs   133.6 GB/sec    1.01    119.1±0.47µs   132.3 GB/sec
struct_all_null/parquet_2                          1.00    117.5±0.47µs   134.1 GB/sec    1.03    120.5±0.57µs   130.7 GB/sec
struct_all_null/zstd                               1.00    164.9±0.65µs    95.5 GB/sec    1.01    166.6±1.15µs    94.5 GB/sec
struct_all_null/zstd_parquet_2                     1.00    150.6±0.50µs   104.6 GB/sec    1.02    153.4±0.74µs   102.6 GB/sec
struct_non_null/bloom_filter                       1.00     44.1±1.40ms   362.6 MB/sec    1.06     46.8±1.34ms   341.8 MB/sec
struct_non_null/cdc                                1.00     42.3±0.63ms   378.0 MB/sec    1.08     45.7±0.87ms   349.8 MB/sec
struct_non_null/default                            1.00     29.6±0.65ms   540.2 MB/sec    1.08     32.1±0.14ms   499.2 MB/sec
struct_non_null/parquet_2                          1.00     37.8±0.11ms   423.7 MB/sec    1.07     40.5±0.23ms   394.7 MB/sec
struct_non_null/zstd                               1.00     38.4±0.38ms   416.5 MB/sec    1.07     41.0±0.49ms   390.3 MB/sec
struct_non_null/zstd_parquet_2                     1.00     52.6±0.64ms   304.1 MB/sec    1.06     55.6±0.66ms   287.5 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      6.4±0.02ms     2.5 GB/sec    1.00      6.4±0.23ms     2.4 GB/sec
struct_sparse_99pct_null/cdc                       1.00     13.8±0.34ms  1170.5 MB/sec    1.05     14.4±0.13ms  1118.8 MB/sec
struct_sparse_99pct_null/default                   1.04      6.1±0.23ms     2.6 GB/sec    1.00      5.9±0.02ms     2.7 GB/sec
struct_sparse_99pct_null/parquet_2                 1.02      6.0±0.04ms     2.6 GB/sec    1.00      5.9±0.02ms     2.7 GB/sec
struct_sparse_99pct_null/zstd                      1.03      7.8±0.04ms     2.0 GB/sec    1.00      7.6±0.13ms     2.1 GB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.02      7.1±0.22ms     2.2 GB/sec    1.00      7.0±0.20ms     2.2 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1945.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1889.7s
CPU sys 54.1s
Peak spill 0 B

branch

Metric Value
Wall time 1920.4s
Peak memory 6.6 GiB
Avg memory 6.3 GiB
CPU user 1870.7s
CPU sys 47.3s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              faster_sparse_columns_encoding         main
-----                                              ------------------------------         ----
bool/bloom_filter                                  1.00     12.9±0.03ms    19.3 MB/sec    1.01     13.0±0.05ms    19.2 MB/sec
bool/cdc                                           1.00     15.8±0.06ms    15.8 MB/sec    1.02     16.1±0.16ms    15.5 MB/sec
bool/default                                       1.00     11.0±0.06ms    22.8 MB/sec    1.00     10.9±0.04ms    22.9 MB/sec
bool/parquet_2                                     1.00     14.7±0.06ms    17.0 MB/sec    1.01     14.8±0.03ms    16.9 MB/sec
bool/zstd                                          1.00     11.5±0.06ms    21.8 MB/sec    1.00     11.5±0.05ms    21.8 MB/sec
bool/zstd_parquet_2                                1.00     15.0±0.05ms    16.6 MB/sec    1.01     15.1±0.07ms    16.5 MB/sec
bool_non_null/bloom_filter                         1.00      6.8±0.05ms    18.4 MB/sec    1.04      7.1±0.03ms    17.7 MB/sec
bool_non_null/cdc                                  1.00      6.5±0.05ms    19.2 MB/sec    1.07      6.9±0.13ms    18.0 MB/sec
bool_non_null/default                              1.00      4.0±0.02ms    31.1 MB/sec    1.07      4.3±0.02ms    29.2 MB/sec
bool_non_null/parquet_2                            1.00      8.7±0.03ms    14.4 MB/sec    1.04      9.0±0.03ms    13.8 MB/sec
bool_non_null/zstd                                 1.00      4.4±0.03ms    28.6 MB/sec    1.06      4.6±0.02ms    26.9 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.1±0.04ms    13.7 MB/sec    1.04      9.5±0.05ms    13.2 MB/sec
float_with_nans/bloom_filter                       1.00     93.6±2.71ms   149.5 MB/sec    1.01     94.5±2.51ms   148.1 MB/sec
float_with_nans/cdc                                1.00     79.7±0.80ms   175.6 MB/sec    1.03     81.9±0.92ms   171.0 MB/sec
float_with_nans/default                            1.00     74.0±1.22ms   189.1 MB/sec    1.03     76.1±1.09ms   184.0 MB/sec
float_with_nans/parquet_2                          1.00     93.3±0.41ms   150.0 MB/sec    1.02     94.9±1.35ms   147.5 MB/sec
float_with_nans/zstd                               1.00    110.9±0.66ms   126.3 MB/sec    1.02    112.6±0.86ms   124.3 MB/sec
float_with_nans/zstd_parquet_2                     1.00    132.1±1.90ms   106.0 MB/sec    1.02    134.1±2.17ms   104.4 MB/sec
list_primitive/bloom_filter                        1.10   375.8±11.90ms  1451.1 MB/sec    1.00   341.9±11.81ms  1595.3 MB/sec
list_primitive/cdc                                 1.13    420.4±3.68ms  1297.4 MB/sec    1.00    371.0±5.37ms  1469.8 MB/sec
list_primitive/default                             1.15    295.0±5.98ms  1848.5 MB/sec    1.00    257.6±4.02ms     2.1 GB/sec
list_primitive/parquet_2                           1.13    315.9±2.71ms  1726.5 MB/sec    1.00    279.7±2.43ms  1949.9 MB/sec
list_primitive/zstd                                1.07    543.1±5.73ms  1004.2 MB/sec    1.00    509.4±6.75ms  1070.7 MB/sec
list_primitive/zstd_parquet_2                      1.07    538.9±3.50ms  1012.0 MB/sec    1.00    503.4±3.20ms  1083.4 MB/sec
list_primitive_non_null/bloom_filter               1.00   409.7±19.96ms  1328.3 MB/sec    1.10   449.4±21.64ms  1211.0 MB/sec
list_primitive_non_null/cdc                        1.00   415.0±11.02ms  1311.3 MB/sec    1.07    443.1±9.67ms  1228.4 MB/sec
list_primitive_non_null/default                    1.00    265.5±8.21ms     2.0 GB/sec    1.15    305.8±8.52ms  1779.9 MB/sec
list_primitive_non_null/parquet_2                  1.00   298.6±10.42ms  1822.6 MB/sec    1.04   310.3±15.19ms  1754.0 MB/sec
list_primitive_non_null/zstd                       1.00   680.8±16.17ms   799.4 MB/sec    1.07   727.5±12.29ms   748.1 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    678.0±6.54ms   802.7 MB/sec    1.02    689.8±5.55ms   789.0 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.03     12.4±0.59ms     2.9 GB/sec    1.00     12.1±0.39ms     3.0 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     22.8±0.51ms  1635.4 MB/sec    1.06     24.1±0.08ms  1550.2 MB/sec
list_primitive_sparse_99pct_null/default           1.00     11.8±0.28ms     3.1 GB/sec    1.00     11.9±0.24ms     3.1 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     11.5±0.07ms     3.2 GB/sec    1.01     11.6±0.24ms     3.1 GB/sec
list_primitive_sparse_99pct_null/zstd              1.02     13.6±0.33ms     2.7 GB/sec    1.00     13.3±0.04ms     2.7 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.04     12.4±0.11ms     2.9 GB/sec    1.00     11.9±0.11ms     3.1 GB/sec
primitive/bloom_filter                             1.02    155.3±3.10ms   288.9 MB/sec    1.00    152.5±3.06ms   294.2 MB/sec
primitive/cdc                                      1.00    159.8±2.18ms   280.8 MB/sec    1.02    163.1±2.18ms   275.1 MB/sec
primitive/default                                  1.00    119.0±0.59ms   377.1 MB/sec    1.00    119.3±2.07ms   376.2 MB/sec
primitive/parquet_2                                1.00    133.9±1.84ms   335.0 MB/sec    1.01    134.9±1.30ms   332.7 MB/sec
primitive/zstd                                     1.00    147.7±0.50ms   303.7 MB/sec    1.01    148.7±2.02ms   301.7 MB/sec
primitive/zstd_parquet_2                           1.01    168.4±1.84ms   266.5 MB/sec    1.00    166.8±1.73ms   269.0 MB/sec
primitive_all_null/bloom_filter                    1.00   902.8±26.37µs    48.5 GB/sec    1.04   940.3±50.44µs    46.6 GB/sec
primitive_all_null/cdc                             1.00     18.2±0.25ms     2.4 GB/sec    1.04     18.9±0.35ms     2.3 GB/sec
primitive_all_null/default                         1.00    274.1±1.10µs   159.9 GB/sec    1.00    273.5±1.22µs   160.2 GB/sec
primitive_all_null/parquet_2                       1.00    275.5±1.23µs   159.1 GB/sec    1.01    277.5±1.44µs   157.9 GB/sec
primitive_all_null/zstd                            1.00    387.9±1.24µs   113.0 GB/sec    1.00    387.5±0.83µs   113.1 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    352.8±1.01µs   124.2 GB/sec    1.01    357.7±1.19µs   122.5 GB/sec
primitive_non_null/bloom_filter                    1.00    101.8±1.31ms   432.4 MB/sec    1.07    108.7±1.85ms   404.7 MB/sec
primitive_non_null/cdc                             1.00     81.7±0.49ms   538.5 MB/sec    1.11     90.7±1.43ms   485.2 MB/sec
primitive_non_null/default                         1.00     59.8±0.14ms   735.4 MB/sec    1.13     67.4±0.24ms   652.9 MB/sec
primitive_non_null/parquet_2                       1.00     82.3±0.83ms   534.7 MB/sec    1.09     89.3±0.26ms   492.7 MB/sec
primitive_non_null/zstd                            1.00     91.4±1.20ms   481.2 MB/sec    1.17    107.3±1.43ms   410.0 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    115.6±0.93ms   380.8 MB/sec    1.13    130.3±3.03ms   337.7 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.01     12.7±0.63ms     3.4 GB/sec    1.00     12.6±0.77ms     3.5 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     29.9±0.44ms  1501.8 MB/sec    1.09     32.5±0.31ms  1382.7 MB/sec
primitive_sparse_99pct_null/default                1.00     11.0±0.32ms     4.0 GB/sec    1.00     11.0±0.30ms     4.0 GB/sec
primitive_sparse_99pct_null/parquet_2              1.00     10.8±0.08ms     4.1 GB/sec    1.00     10.8±0.38ms     4.1 GB/sec
primitive_sparse_99pct_null/zstd                   1.05     14.5±0.18ms     3.0 GB/sec    1.00     13.8±0.05ms     3.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.05     13.2±0.12ms     3.3 GB/sec    1.00     12.6±0.05ms     3.5 GB/sec
string/bloom_filter                                1.00   222.9±20.96ms     2.3 GB/sec    1.07   239.6±30.09ms     2.1 GB/sec
string/cdc                                         1.00    217.4±4.65ms     2.4 GB/sec    1.03    224.3±7.42ms     2.3 GB/sec
string/default                                     1.10   142.0±23.07ms     3.6 GB/sec    1.00   129.5±22.43ms     4.0 GB/sec
string/parquet_2                                   1.13    127.9±1.64ms     4.0 GB/sec    1.00    112.8±8.25ms     4.5 GB/sec
string/zstd                                        1.05   442.5±19.26ms  1184.8 MB/sec    1.00    423.3±7.82ms  1238.6 MB/sec
string/zstd_parquet_2                              1.00    397.6±3.24ms  1318.4 MB/sec    1.02    405.1±6.01ms  1294.1 MB/sec
string_and_binary_view/bloom_filter                1.00     66.3±0.23ms   486.2 MB/sec    1.04     68.8±1.00ms   469.0 MB/sec
string_and_binary_view/cdc                         1.09     63.6±0.67ms   506.8 MB/sec    1.00     58.4±0.17ms   552.5 MB/sec
string_and_binary_view/default                     1.07     51.7±1.11ms   623.3 MB/sec    1.00     48.2±0.18ms   669.4 MB/sec
string_and_binary_view/parquet_2                   1.01     61.9±0.79ms   521.2 MB/sec    1.00     61.3±1.29ms   526.0 MB/sec
string_and_binary_view/zstd                        1.02     87.6±1.36ms   368.3 MB/sec    1.00     85.5±1.66ms   377.1 MB/sec
string_and_binary_view/zstd_parquet_2              1.03     75.7±0.58ms   425.9 MB/sec    1.00     73.7±1.55ms   437.6 MB/sec
string_dictionary/bloom_filter                     1.00     94.0±7.08ms     2.7 GB/sec    1.01     95.1±5.54ms     2.7 GB/sec
string_dictionary/cdc                              1.05     57.0±3.68ms     4.5 GB/sec    1.00     54.5±3.95ms     4.7 GB/sec
string_dictionary/default                          1.05     51.1±2.33ms     5.1 GB/sec    1.00     48.6±1.71ms     5.3 GB/sec
string_dictionary/parquet_2                        1.00     54.4±1.45ms     4.7 GB/sec    1.02     55.4±2.02ms     4.7 GB/sec
string_dictionary/zstd                             1.00    207.5±1.13ms  1272.7 MB/sec    1.01    209.7±3.60ms  1259.8 MB/sec
string_dictionary/zstd_parquet_2                   1.02    205.3±2.52ms  1286.8 MB/sec    1.00    201.0±1.42ms  1314.4 MB/sec
string_non_null/bloom_filter                       1.00   237.4±14.52ms     2.2 GB/sec    1.15   272.1±22.69ms  1925.9 MB/sec
string_non_null/cdc                                1.00    265.9±9.45ms  1970.9 MB/sec    1.02    272.4±9.09ms  1923.4 MB/sec
string_non_null/default                            1.00    119.2±7.59ms     4.3 GB/sec    1.21   143.7±12.90ms     3.6 GB/sec
string_non_null/parquet_2                          1.00    133.3±6.33ms     3.8 GB/sec    1.14    152.4±4.50ms     3.4 GB/sec
string_non_null/zstd                               1.00    542.6±8.59ms   965.8 MB/sec    1.05   568.1±11.09ms   922.4 MB/sec
string_non_null/zstd_parquet_2                     1.00   511.4±10.10ms  1024.6 MB/sec    1.01    516.7±6.34ms  1014.0 MB/sec
struct_all_null/bloom_filter                       1.00    375.1±1.46µs    42.0 GB/sec    1.02    382.9±5.58µs    41.1 GB/sec
struct_all_null/cdc                                1.00      7.4±0.07ms     2.1 GB/sec    1.02      7.6±0.08ms     2.1 GB/sec
struct_all_null/default                            1.00    118.2±0.37µs   133.2 GB/sec    1.01    119.2±0.30µs   132.1 GB/sec
struct_all_null/parquet_2                          1.00    118.8±0.53µs   132.5 GB/sec    1.02    120.7±0.55µs   130.4 GB/sec
struct_all_null/zstd                               1.01    167.6±0.64µs    94.0 GB/sec    1.00    166.6±0.43µs    94.5 GB/sec
struct_all_null/zstd_parquet_2                     1.00    151.8±0.55µs   103.8 GB/sec    1.01    153.1±0.54µs   102.9 GB/sec
struct_non_null/bloom_filter                       1.00     43.1±0.17ms   371.1 MB/sec    1.11     47.8±1.45ms   334.8 MB/sec
struct_non_null/cdc                                1.00     42.3±0.61ms   378.1 MB/sec    1.10     46.4±0.84ms   344.9 MB/sec
struct_non_null/default                            1.00     29.6±0.61ms   541.0 MB/sec    1.11     32.8±0.99ms   487.2 MB/sec
struct_non_null/parquet_2                          1.00     38.4±0.27ms   416.7 MB/sec    1.07     41.1±0.11ms   389.4 MB/sec
struct_non_null/zstd                               1.00     38.5±0.64ms   415.2 MB/sec    1.08     41.7±0.38ms   383.5 MB/sec
struct_non_null/zstd_parquet_2                     1.00     52.6±0.80ms   304.1 MB/sec    1.07     56.1±0.81ms   285.2 MB/sec
struct_sparse_99pct_null/bloom_filter              1.01      6.5±0.03ms     2.4 GB/sec    1.00      6.5±0.04ms     2.4 GB/sec
struct_sparse_99pct_null/cdc                       1.00     13.7±0.26ms  1177.0 MB/sec    1.07     14.6±0.27ms  1103.2 MB/sec
struct_sparse_99pct_null/default                   1.00      6.0±0.02ms     2.6 GB/sec    1.02      6.1±0.22ms     2.6 GB/sec
struct_sparse_99pct_null/parquet_2                 1.07      6.3±0.05ms     2.5 GB/sec    1.00      5.9±0.02ms     2.7 GB/sec
struct_sparse_99pct_null/zstd                      1.00      7.7±0.20ms     2.0 GB/sec    1.00      7.7±0.04ms     2.1 GB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      6.9±0.27ms     2.3 GB/sec    1.02      7.1±0.14ms     2.2 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1950.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1889.9s
CPU sys 55.4s
Peak spill 0 B

branch

Metric Value
Wall time 1925.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1878.6s
CPU sys 45.6s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              faster_sparse_columns_encoding         main
-----                                              ------------------------------         ----
bool/bloom_filter                                  1.00     13.0±0.05ms    19.2 MB/sec    1.00     13.0±0.05ms    19.2 MB/sec
bool/cdc                                           1.00     15.8±0.12ms    15.9 MB/sec    1.01     16.0±0.07ms    15.7 MB/sec
bool/default                                       1.00     10.9±0.03ms    23.0 MB/sec    1.01     10.9±0.05ms    22.8 MB/sec
bool/parquet_2                                     1.00     14.6±0.05ms    17.1 MB/sec    1.00     14.7±0.04ms    17.0 MB/sec
bool/zstd                                          1.00     11.4±0.04ms    22.0 MB/sec    1.01     11.5±0.04ms    21.8 MB/sec
bool/zstd_parquet_2                                1.00     15.0±0.08ms    16.6 MB/sec    1.00     15.1±0.05ms    16.6 MB/sec
bool_non_null/bloom_filter                         1.00      6.8±0.02ms    18.5 MB/sec    1.04      7.0±0.02ms    17.8 MB/sec
bool_non_null/cdc                                  1.00      6.5±0.06ms    19.2 MB/sec    1.05      6.9±0.03ms    18.2 MB/sec
bool_non_null/default                              1.00      4.0±0.02ms    30.9 MB/sec    1.06      4.3±0.02ms    29.2 MB/sec
bool_non_null/parquet_2                            1.00      8.8±0.04ms    14.3 MB/sec    1.04      9.1±0.04ms    13.8 MB/sec
bool_non_null/zstd                                 1.00      4.4±0.02ms    28.5 MB/sec    1.06      4.6±0.02ms    27.0 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.1±0.04ms    13.7 MB/sec    1.03      9.4±0.04ms    13.2 MB/sec
float_with_nans/bloom_filter                       1.00     91.2±0.34ms   153.4 MB/sec    1.04     94.6±0.45ms   148.0 MB/sec
float_with_nans/cdc                                1.00     79.3±0.17ms   176.6 MB/sec    1.04     82.3±0.19ms   170.0 MB/sec
float_with_nans/default                            1.00     72.4±0.26ms   193.5 MB/sec    1.04     75.0±0.21ms   186.7 MB/sec
float_with_nans/parquet_2                          1.00     92.8±0.35ms   150.9 MB/sec    1.03     95.5±0.49ms   146.6 MB/sec
float_with_nans/zstd                               1.00    110.2±0.22ms   127.0 MB/sec    1.03    113.4±6.19ms   123.5 MB/sec
float_with_nans/zstd_parquet_2                     1.00    130.0±0.37ms   107.7 MB/sec    1.02    132.7±0.25ms   105.5 MB/sec
list_primitive/bloom_filter                        1.11    382.7±1.96ms  1425.2 MB/sec    1.00    344.7±1.86ms  1582.3 MB/sec
list_primitive/cdc                                 1.13    421.2±2.51ms  1294.7 MB/sec    1.00    371.4±4.58ms  1468.5 MB/sec
list_primitive/default                             1.16    301.5±2.52ms  1808.8 MB/sec    1.00    259.5±2.49ms     2.1 GB/sec
list_primitive/parquet_2                           1.15    320.4±1.49ms  1702.1 MB/sec    1.00    279.5±0.86ms  1951.4 MB/sec
list_primitive/zstd                                1.06   545.1±19.64ms  1000.6 MB/sec    1.00    512.2±2.14ms  1064.8 MB/sec
list_primitive/zstd_parquet_2                      1.07    537.8±2.04ms  1014.1 MB/sec    1.00    502.5±1.19ms  1085.3 MB/sec
list_primitive_non_null/bloom_filter               1.00   397.2±10.10ms  1370.2 MB/sec    1.06    422.2±5.29ms  1289.2 MB/sec
list_primitive_non_null/cdc                        1.00   418.8±29.04ms  1299.5 MB/sec    1.05    440.2±8.50ms  1236.2 MB/sec
list_primitive_non_null/default                    1.00    282.5±1.53ms  1926.6 MB/sec    1.01    286.5±4.43ms  1899.8 MB/sec
list_primitive_non_null/parquet_2                  1.01   305.1±15.33ms  1783.6 MB/sec    1.00    303.3±8.29ms  1794.5 MB/sec
list_primitive_non_null/zstd                       1.00    691.1±6.82ms   787.5 MB/sec    1.02    707.0±6.24ms   769.8 MB/sec
list_primitive_non_null/zstd_parquet_2             1.02    690.9±6.33ms   787.8 MB/sec    1.00    675.4±3.67ms   805.7 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     11.9±0.08ms     3.1 GB/sec    1.00     11.9±0.05ms     3.1 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     23.1±0.16ms  1620.1 MB/sec    1.02     23.5±0.12ms  1588.4 MB/sec
list_primitive_sparse_99pct_null/default           1.02     11.7±0.24ms     3.1 GB/sec    1.00     11.6±0.05ms     3.2 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.01     11.7±0.13ms     3.1 GB/sec    1.00     11.6±0.05ms     3.2 GB/sec
list_primitive_sparse_99pct_null/zstd              1.02     13.7±0.17ms     2.7 GB/sec    1.00     13.4±0.05ms     2.7 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.02     11.9±0.13ms     3.1 GB/sec    1.00     11.7±0.04ms     3.1 GB/sec
primitive/bloom_filter                             1.00    151.1±0.50ms   297.0 MB/sec    1.00    151.8±0.56ms   295.7 MB/sec
primitive/cdc                                      1.00    157.9±0.56ms   284.2 MB/sec    1.01    160.0±0.53ms   280.5 MB/sec
primitive/default                                  1.00    117.9±0.33ms   380.7 MB/sec    1.01    119.5±1.57ms   375.6 MB/sec
primitive/parquet_2                                1.00    132.8±0.33ms   337.8 MB/sec    1.01    134.0±0.46ms   334.8 MB/sec
primitive/zstd                                     1.00    147.8±0.43ms   303.6 MB/sec    1.00    148.1±0.43ms   303.1 MB/sec
primitive/zstd_parquet_2                           1.00    166.4±0.37ms   269.7 MB/sec    1.00    166.4±0.43ms   269.7 MB/sec
primitive_all_null/bloom_filter                    1.01    900.3±5.07µs    48.7 GB/sec    1.00    895.7±2.83µs    48.9 GB/sec
primitive_all_null/cdc                             1.00     18.0±0.13ms     2.4 GB/sec    1.04     18.8±0.32ms     2.3 GB/sec
primitive_all_null/default                         1.00    272.9±0.89µs   160.6 GB/sec    1.00    273.6±0.75µs   160.2 GB/sec
primitive_all_null/parquet_2                       1.00    273.1±0.73µs   160.5 GB/sec    1.02    277.5±1.17µs   157.9 GB/sec
primitive_all_null/zstd                            1.00    387.6±1.00µs   113.1 GB/sec    1.00    387.5±0.90µs   113.1 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    350.4±1.07µs   125.1 GB/sec    1.02    356.0±1.34µs   123.1 GB/sec
primitive_non_null/bloom_filter                    1.00    100.5±0.22ms   437.8 MB/sec    1.07    107.3±0.40ms   409.9 MB/sec
primitive_non_null/cdc                             1.00     81.6±0.24ms   539.3 MB/sec    1.10     90.1±0.35ms   488.4 MB/sec
primitive_non_null/default                         1.00     60.1±0.13ms   731.7 MB/sec    1.12     67.6±0.23ms   650.5 MB/sec
primitive_non_null/parquet_2                       1.00     81.7±0.15ms   538.5 MB/sec    1.09     89.4±0.34ms   492.1 MB/sec
primitive_non_null/zstd                            1.00     90.9±0.11ms   484.2 MB/sec    1.15    104.8±0.96ms   419.8 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    115.5±0.15ms   381.0 MB/sec    1.12    129.0±2.54ms   341.1 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.03     12.4±0.19ms     3.5 GB/sec    1.00     12.0±0.10ms     3.6 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     29.7±0.20ms  1510.9 MB/sec    1.07     31.8±0.31ms  1409.7 MB/sec
primitive_sparse_99pct_null/default                1.02     10.8±0.08ms     4.1 GB/sec    1.00     10.6±0.04ms     4.1 GB/sec
primitive_sparse_99pct_null/parquet_2              1.02     10.8±0.10ms     4.1 GB/sec    1.00     10.6±0.05ms     4.1 GB/sec
primitive_sparse_99pct_null/zstd                   1.01     14.1±0.09ms     3.1 GB/sec    1.00     13.9±0.07ms     3.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.01     12.7±0.09ms     3.5 GB/sec    1.00     12.5±0.05ms     3.5 GB/sec
string/bloom_filter                                1.00   218.5±16.29ms     2.3 GB/sec    1.04   227.1±21.78ms     2.3 GB/sec
string/cdc                                         1.00    217.6±4.99ms     2.4 GB/sec    1.03    223.2±4.96ms     2.3 GB/sec
string/default                                     1.03   135.2±21.33ms     3.8 GB/sec    1.00   131.2±22.32ms     3.9 GB/sec
string/parquet_2                                   1.12    126.2±0.92ms     4.1 GB/sec    1.00    112.6±6.67ms     4.5 GB/sec
string/zstd                                        1.05   440.1±19.42ms  1191.2 MB/sec    1.00    419.0±2.03ms  1251.3 MB/sec
string/zstd_parquet_2                              1.00    395.7±1.04ms  1324.7 MB/sec    1.02    403.9±7.02ms  1298.1 MB/sec
string_and_binary_view/bloom_filter                1.04     68.0±0.25ms   474.1 MB/sec    1.00     65.5±0.26ms   492.1 MB/sec
string_and_binary_view/cdc                         1.05     62.4±0.15ms   516.7 MB/sec    1.00     59.2±0.18ms   544.9 MB/sec
string_and_binary_view/default                     1.05     50.8±0.11ms   634.8 MB/sec    1.00     48.5±0.16ms   664.9 MB/sec
string_and_binary_view/parquet_2                   1.04     62.0±0.26ms   520.2 MB/sec    1.00     59.7±0.21ms   540.4 MB/sec
string_and_binary_view/zstd                        1.03     87.6±0.30ms   368.0 MB/sec    1.00     85.1±0.14ms   378.8 MB/sec
string_and_binary_view/zstd_parquet_2              1.03     75.8±0.15ms   425.7 MB/sec    1.00     73.6±0.17ms   438.5 MB/sec
string_dictionary/bloom_filter                     1.00     90.9±0.53ms     2.8 GB/sec    1.06     96.4±6.77ms     2.7 GB/sec
string_dictionary/cdc                              1.07     60.3±4.73ms     4.3 GB/sec    1.00     56.3±0.65ms     4.6 GB/sec
string_dictionary/default                          1.00     47.2±0.74ms     5.5 GB/sec    1.01     47.6±0.98ms     5.4 GB/sec
string_dictionary/parquet_2                        1.00     54.7±0.34ms     4.7 GB/sec    1.01     55.2±0.48ms     4.7 GB/sec
string_dictionary/zstd                             1.00    207.7±1.07ms  1271.7 MB/sec    1.02    212.0±2.21ms  1245.7 MB/sec
string_dictionary/zstd_parquet_2                   1.00    199.3±0.16ms  1325.4 MB/sec    1.01    200.5±0.34ms  1317.6 MB/sec
string_non_null/bloom_filter                       1.00   229.8±14.08ms     2.2 GB/sec    1.17   268.3±14.36ms  1952.7 MB/sec
string_non_null/cdc                                1.00   273.2±14.07ms  1917.9 MB/sec    1.01    277.1±8.67ms  1891.1 MB/sec
string_non_null/default                            1.00   115.3±14.14ms     4.4 GB/sec    1.29   148.9±13.30ms     3.4 GB/sec
string_non_null/parquet_2                          1.42    202.7±3.20ms     2.5 GB/sec    1.00    142.8±6.84ms     3.6 GB/sec
string_non_null/zstd                               1.03   555.5±26.40ms   943.4 MB/sec    1.00    537.7±2.63ms   974.5 MB/sec
string_non_null/zstd_parquet_2                     1.03    518.7±8.97ms  1010.2 MB/sec    1.00    505.6±1.19ms  1036.3 MB/sec
struct_all_null/bloom_filter                       1.00    374.6±1.37µs    42.0 GB/sec    1.01    376.8±1.85µs    41.8 GB/sec
struct_all_null/cdc                                1.00      7.5±0.12ms     2.1 GB/sec    1.03      7.7±0.12ms     2.1 GB/sec
struct_all_null/default                            1.00    118.8±0.37µs   132.5 GB/sec    1.00    118.3±0.34µs   133.1 GB/sec
struct_all_null/parquet_2                          1.00    118.9±0.33µs   132.4 GB/sec    1.00    119.3±0.48µs   132.0 GB/sec
struct_all_null/zstd                               1.00    166.3±4.84µs    94.7 GB/sec    1.00    166.0±0.42µs    94.9 GB/sec
struct_all_null/zstd_parquet_2                     1.00    152.0±0.50µs   103.6 GB/sec    1.01    153.1±0.62µs   102.9 GB/sec
struct_non_null/bloom_filter                       1.00     43.2±0.12ms   370.4 MB/sec    1.07     46.1±0.16ms   347.1 MB/sec
struct_non_null/cdc                                1.00     42.3±0.14ms   378.6 MB/sec    1.10     46.3±1.62ms   345.2 MB/sec
struct_non_null/default                            1.00     29.4±0.12ms   544.8 MB/sec    1.09     32.0±0.12ms   500.6 MB/sec
struct_non_null/parquet_2                          1.00     38.5±0.21ms   415.8 MB/sec    1.06     40.7±0.12ms   392.9 MB/sec
struct_non_null/zstd                               1.00     38.2±0.09ms   419.4 MB/sec    1.07     40.6±0.09ms   393.6 MB/sec
struct_non_null/zstd_parquet_2                     1.00     52.3±0.22ms   305.8 MB/sec    1.04     54.7±0.12ms   292.7 MB/sec
struct_sparse_99pct_null/bloom_filter              1.04      6.7±0.08ms     2.4 GB/sec    1.00      6.4±0.03ms     2.4 GB/sec
struct_sparse_99pct_null/cdc                       1.00     14.1±0.11ms  1142.8 MB/sec    1.04     14.6±0.12ms  1103.1 MB/sec
struct_sparse_99pct_null/default                   1.03      6.1±0.05ms     2.6 GB/sec    1.00      5.9±0.02ms     2.7 GB/sec
struct_sparse_99pct_null/parquet_2                 1.03      6.1±0.06ms     2.6 GB/sec    1.00      5.9±0.02ms     2.7 GB/sec
struct_sparse_99pct_null/zstd                      1.03      7.5±0.05ms     2.1 GB/sec    1.00      7.3±0.04ms     2.2 GB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      6.9±0.08ms     2.3 GB/sec    1.00      6.9±0.04ms     2.3 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1940.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1889.5s
CPU sys 45.8s
Peak spill 0 B

branch

Metric Value
Wall time 1940.4s
Peak memory 6.6 GiB
Avg memory 6.3 GiB
CPU user 1869.3s
CPU sys 69.2s
Peak spill 0 B

File an issue against this benchmark runner

@HippoBaro
Copy link
Copy Markdown
Contributor Author

The above regressions to CDC-related benchmarks are real. I'll look into it. Sorry!

Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
# Which issue does this PR close?

- None, but relates to apache#9653

# Rationale for this change

apache#9653 introduces optimizations related to non-null uniform workloads.
This adds benchmarks so we can quantify them.

# What changes are included in this PR?

Add three new benchmark cases to the arrow_writer benchmark suite for
evaluating write performance on struct columns at varying null
densities:

* `struct_non_null`: a nullable struct with 0% null rows and
non-nullable primitive children;
* `struct_sparse_99pct_null`: a nullable struct with 99% null rows,
exercising null batching through one level of struct nesting;
* `struct_all_null`: a nullable struct with 100% null rows, exercising
the uniform-null path through struct nesting.

Baseline results (Apple M1 Max):
```
  struct_non_null/default              29.9 ms
  struct_non_null/parquet_2            38.2 ms
  struct_non_null/zstd_parquet_2       50.9 ms
  struct_sparse_99pct_null/default      7.2 ms
  struct_sparse_99pct_null/parquet_2    7.3 ms
  struct_sparse_99pct_null/zstd_p2      8.1 ms
  struct_all_null/default              83.3 µs
  struct_all_null/parquet_2            82.5 µs
  struct_all_null/zstd_parquet_2      106.6 µs
```

# Are these changes tested?

N/A

# Are there any user-facing changes?

None

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…pache#9751)

# Which issue does this PR close?

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

The literal `8` appeared in two distinct roles throughout `RleEncoder`,
`RleDecoder`, and their tests.

# What changes are included in this PR?

Replacing each with a named constant makes the intent explicit and
prevents the two meanings from being confused.

* `BIT_PACK_GROUP_SIZE = 8` The Parquet RLE/bit-packing hybrid format
always bit-packs values in multiples of this count (spec: "we always
bit-pack a multiple of 8 values at a time"). Every occurrence related to
the staging buffer size, the repeat-count threshold that triggers the
RLE decision, and the group-count arithmetic in bit-packed headers now
uses this name.

* `u8::BITS` (= 8, from std) Used wherever a bit-count is divided by 8
to obtain a byte-count (e.g. `ceil(bit_width, u8::BITS as usize)`). This
is a bits-per-byte conversion, a fundamentally different concept from
the packing-group size.

No behaviour change.

# Are these changes tested?

All tests passing.

# Are there any user-facing changes?

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…che#9752)

# Which issue does this PR close?

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

See apache#9731

# What changes are included in this PR?

Restructure `write_list()` to accumulate consecutive null and empty rows
and flush them in a single `visit_leaves()` call using
`extend(repeat_n(...))`, instead of calling `visit_leaves()` per row.

With sparse data (99% nulls), a 4096-row batch previously triggered
~4000 individual tree traversals, each pushing a single value per leaf.
Now consecutive null/empty runs are collapsed into one traversal that
extends all leaf level buffers in bulk.

This follows the same pattern already used by `write_struct()`. The
`write_non_null_slice` path is unchanged since each non-null row has
different offsets and cannot be batched.

# Are these changes tested?

All tests passing; existing tests give 100% coverage.

# Are there any user-facing changes?

N/A

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
apache#9795)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

See apache#9731

# What changes are included in this PR?

Add `put_with_observer()` to `LevelEncoder` that calls an `FnMut(i16,
usize)` observer for each value during encoding. This allows callers to
piggyback counting and histogram updates into the encoding pass without
extra iterations over the level buffer.

Previously, `write_mini_batch()` made 3 separate passes over each level
array: one to count non-null values or row boundaries, one to update the
level histogram, and one to RLE-encode. Now all three operations happen
in a single pass via the observer closure.

Replace `LevelHistogram::update_from_levels()` with a new
`LevelHistogram::increment_by()` that accepts a count, and remove the
now-unnecessary `update_definition_level_histogram()` and
`update_repetition_level_histogram()` methods from PageMetrics.

# Are these changes tested?

All tests passing; existing tests give 100% coverage.

# Are there any user-facing changes?

None

---------

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…#9830)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

See apache#9731

# What changes are included in this PR?

Add `is_accumulating_rle()` and `extend_run()` methods to `RleEncoder`
that allow callers to detect when the encoder is in RLE accumulation
mode and bulk-extend runs without per-element overhead.

Upgrade `put_with_observer()` in `LevelEncoder` to exploit this: after
each `put()`, it checks whether the encoder entered accumulation mode.
If so, it scans ahead for the rest of the run, calls `extend_run()` to
batch it in O(1), and fires the observer once with the full run length.

This turns the previous O(n) per-value encoding + observation into O(1)
amortized per RLE run, which is a significant improvement for sparse
columns where long runs of identical levels are common.

# Are these changes tested?

All tests passing + added coverage around RLE accumulation mode trigger.

# Are there any user-facing changes?

None.

---------

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…tch (apache#9831)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

See apache#9731

# What changes are included in this PR?

Represent definition and repetition levels as `LevelData`/`LevelDataRef`
with `Absent`, `Materialized`, and `Uniform` variants, and thread this
through Arrow level generation, CDC chunking, and the generic column
writer.

Uniform level runs, such as required fields and all-null pages, can now
be encoded without materializing dense `Vec<i16>` buffers. Add bulk run
support to `LevelEncoder`/`RleEncoder` so repeated levels are encoded in
amortized O(1) after the RLE warmup, while preserving histogram, row
count, null count, page splitting, and CDC chunk accounting.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

All tests passing. Coverage exercises bulk RLE level encoding,
compact/uniform `LevelData` slicing and writer roundtrips across Parquet
v1/v2, and CDC/Arrow writer behavior including all-null and nested-level
cases.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

---------

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

See apache#9731

# What changes are included in this PR?

When an entire list, struct, fixed-size list, or leaf array is null,
skip per-row iteration and emit bulk uniform def/rep levels via
`extend_uniform_levels` in O(1).

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

All tests passing + additional all null unit tests.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
## Which issue does this PR close?

- Contributes to apache#9731.

## AI assistance

Implementation drafted with AI assistance and iterated against the
benchmarks below. I've reviewed and own the code, including the gate
threshold which I picked from the sweep in [Threshold
(`BULK_FILL_MIN_LEN`)](#threshold-bulk_fill_min_len). Per the project's
[CONTRIBUTING guidance on AI-generated
submissions](https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions).

## Rationale for this change

When writing a nullable leaf (primitive) Arrow array, `write_leaf`
builds the definition-level buffer one element at a time, mapping each
null bit to a level. For columns that are mostly null this does
~`num_rows` of branchy work and allocates a `num_rows`-element level
buffer even though almost every produced level is the same value. apache#9954
adds an O(1) fast path for the *entirely* null case; this PR covers the
*sparse* (mostly-but-not-entirely null) case it doesn't handle, the
literal subject of apache#9731 ("a column that is 99% null … ~100x more work
than necessary").

## What changes are included in this PR?

A single popcount pass over the null mask
(`Buffer::count_set_bits_offset`, O(`num_rows`/64)) counts the valid
values in the range. When the slice is majority-null, the
definition-level buffer is bulk-filled with the null level (a vectorized
`Vec::resize` memset) and only the non-null positions (from
`NullBuffer::valid_indices()`) are overwritten. The existing per-row
path is kept for non-majority-null slices, so balanced and null-light
columns are unaffected. Both branches share the same `let range_nulls =
nulls.slice(range.start, len)` slicing idiom; the slow path uses
`range_nulls.iter()` for the def-level map and
`range_nulls.valid_indices().map(|i| i + range.start)` for
`non_null_indices`, with no `unsafe`. Output is byte-identical: the
level *values* are unchanged, just produced via memset+scatter (fast
path) or via the high-level `NullBuffer` iterators (slow path) instead
of a manual `BitIndexIterator` walk.

## Threshold (`BULK_FILL_MIN_LEN`)

The bulk-fill fast path is gated on two conditions:

- `len >= BULK_FILL_MIN_LEN` (currently 64). Per-call
slice/popcount/iterator overhead only amortizes on sizable sub-ranges.
List/struct paths call `write_leaf` many times with tiny ranges (avg
list length 1-5); paying any per-call popcount there would regress them.
A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X
shows the regression floor settles by T=32, and the choice of 64 gives
~12x margin over the average list length without losing the
flat-primitive wins.
- `nulls.null_count() * 2 >= nulls.len()`. The cached `null_count()` is
O(1), so this check is free. We use the buffer-wide density as a
heuristic for the sub-range; for full-array writes (the primary target,
flat primitive columns) it's exact.

Even when the gate skips the fast path, evaluating it across
high-frequency call sites (~10K calls in some list benchmarks) is a
small structural cost (~1-2% on list-sparse cases). The wins on the
targeted shapes (-35% sparse-primitive, -66% all-null primitive) far
outweigh that. Reducing the cost further would require hoisting the
decision into the caller.

## Are these changes tested?

Existing tests cover this path: `cargo test -p parquet --features arrow
--lib arrow_writer` is green (136 tests, full of nulls and roundtrips);
full `cargo test -p parquet --features arrow` green modulo the
pre-existing `PARQUET_TEST_DATA` submodule failures (unrelated, same on
`main`). `cargo clippy -p parquet --features arrow --lib` and `cargo fmt
--check` clean. The `unsafe get_unchecked_mut` flagged in the original
revision was replaced via `NullBuffer::valid_indices()`; the slow-path
also dropped its `unsafe value_unchecked` for the same reason.

## Are there any user-facing changes?

None.

## Benchmarks

`cargo bench -p parquet --bench arrow_writer`, 1M rows × 7 nullable
primitive columns, local Ryzen 9 9950X:

```
primitive_sparse_99pct_null/default   11.88 ms -> 9.13 ms   (-23%)   <- the case apache#9731 calls out
primitive_all_null/default             5.65 ms -> 2.33 ms   (-59%)   (subsumed by apache#9954's O(1) path if that lands first)
struct_sparse_99pct_null/default       5.67 ms -> 5.32 ms   (-6%)
struct_all_null/default                1.52 ms -> 1.31 ms   (-14%)
list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string:  within noise (no regression)
```

The CI benchmark bot (GKE `c4a-highmem-16`, Neoverse-V2) on the
post-fixup revision shows the same shape with stronger relative wins on
the targeted cases:

```
primitive_all_null/default              2.47x (11.0ms -> 4.4ms)
primitive_sparse_99pct_null/default     1.60x (16.8ms -> 10.5ms)
primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2}    1.38x to 2.48x
primitive_sparse_99pct_null/{...}        1.28x to 1.59x
list_primitive*, list_primitive_sparse_99pct_null*:                    1.00x to 1.01x (within noise)
```

Microbench of the definition-level fill in isolation: 10.3x @ 100%-null,
8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈
12-15% null, clean win above ~25%; the `>= 50% null` guard is
conservative.

This is the *materialization*-cost half of apache#9731 (~30% of the 99%-null
write); the *walk*-cost half, a run-length input to the level encoder so
the column writer doesn't even iterate all `num_rows` levels, is the
larger structural change apache#9653 is heading toward. This PR is
deliberately small and isolated so it lands independently of and rebases
cleanly under that work.

---------

Co-authored-by: Ryan Stewart <noreply@example.com>
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Spawn off from apache#9653 
- Contributes to apache#9731

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

See apache#9731

# What changes are included in this PR?

Changes `byte_array` encoder methods (`FallbackEncoder::encode`,
`DictEncoder::encode`, etc) and all `get_*_array_slice` functions from
`&[usize]` to `impl ExactSizeIterator<Item = usize>`.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?

If this PR claims a performance improvement, please include evidence
such as benchmark results.
-->

All tests passing.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: level encoding cost should be proportional to RLE output size

6 participants