Add wide-schema benchmark suite for measuring per-file metadata overhead by adriangb · Pull Request #21970 · apache/datafusion

adriangb · 2026-05-01T06:43:19Z

Which issue does this PR close?

#21968

Rationale for this change

Adds a benchmark suite that isolates a wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query references. Existing benchmarks don't exercise this shape — most TPC-H/ClickBench queries either touch many columns or filter heavily enough that scan work dominates. We need a focused benchmark so this kind of regression is measurable in CI and so optimizations to the wide-schema scan path can be validated.

What changes are included in this PR?

A new sql_benchmarks suite, wide_schema/ (under benchmarks/sql_benchmarks/), with two subgroups selected via BENCH_SUBGROUP:

wide — runs against a synthesized wide dataset (1024 cols × 256 files × 50 k rows, ~225 MB). This is the actual workload.
narrow — runs the same SQL against an 8-col version of the same dataset (same row count, file count, per-file row-group shape, ~110 MB). This subgroup exists only as a baseline for the wide subgroup; reading its numbers in isolation tells you very little. The per-query wide-vs-narrow ratio is what isolates the schema-width cost.

All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline.

A new binary, gen_wide_data (in benchmarks/src/bin/), synthesizes both datasets in ~60 s with no external data dependency. The 8-column base schema is generic (id, value, count, ts, category, flag, status, text) and carries deterministic synthetic data; the suffix-renamed copies (id_2, id_3, …, id_128, etc.) are zero-filled. Two design notes:

The base columns sit at the end of the schema (positions 1017–1024), with the zero-padded suffix copies before them. Column lookup for the filter / project columns has to traverse past all the padding, exercising any per-column-position cost in the scanner / planner.
Padding is zero-filled rather than all-null because the parquet reader can shortcut on all-null statistics — the slowdown reproduces ~35 % wider with zero padding than with null.

Query coverage:

Q01 — filter + project + ORDER BY + LIMIT (TopK shortcut)
Q02 — project 1 column with a tight filter and LIMIT 1
Q03 — tight filter + small projection, no sort
Q04 — two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity, project two columns, no LIMIT or ORDER BY

bench.sh additions:

./benchmarks/bench.sh data wide_schema    # synthesizes both subgroups' datasets, ~60 s, ~335 MB total
./benchmarks/bench.sh run  wide_schema    # runs both 'wide' and 'narrow' subgroups

Are these changes tested?

Yes — measurements on a M-series Mac (12-way parallel scan, hot OS cache).

Criterion (3 s warmup, 10 samples, median):

Query	narrow	wide	slowdown
Q01 (TopK shortcut: ORDER BY + LIMIT)	79.3 ms	142.7 ms	1.80×
Q02 (project 1 col, LIMIT 1)	2.1 ms	9.7 ms	4.70×
Q03 (filter+project, no sort)	3.2 ms	12.7 ms	4.01×
Q04 (string filter + tight selectivity, no LIMIT)	36.3 ms	117.8 ms	3.25×

Cold-start datafusion-cli (Q04 shape, median of 3):

narrow	wide	slowdown
70 ms	1160 ms	~17×

EXPLAIN ANALYZE phase deltas (Q04, cumulative across 12 scan tasks):

Phase	narrow	wide	Δ
metadata_load_time	7.0 ms	843.9 ms	120×
time_elapsed_opening	65.7 ms	77.9 ms	1.2×
time_elapsed_processing	338.0 ms	875.7 ms	2.6×
bloom_filter_eval_time	2.0 ms	2.3 ms	flat ✓
statistics_eval_time	12.9 ms	13.5 ms	flat ✓

Same qualitative shape: metadata_load_time and per-file setup scale with column-chunk count; predicate-evaluation phases stay flat regardless of schema width.

cargo fmt --all and cargo clippy --bin gen_wide_data --all-features -- -D warnings are clean.

Are there any user-facing changes?

No public API changes. New benchmark suite + new utility binary under benchmarks/.

Adds a new sql_benchmarks suite that isolates the wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query touches. The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP): - wide: 1024 cols x 256 files x 50k rows (~225 MB) — the workload - narrow: 8 cols x 256 files x 50k rows (~110 MB) — internal baseline, only meaningful as a comparison point Both share row count, file count, and per-file row-group structure; only schema width differs. All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline. A new gen_wide_data binary synthesizes both datasets in ~60 s with no external data source. The 8-column base schema (id, value, count, ts, category, flag, status, text) carries deterministic data; copies 2..N from the suffix-renamed widening are zero-filled (zero rather than null since reader-side null-array shortcuts mute the slowdown by ~35 %). Query coverage: - Q01: filter + project + ORDER BY + LIMIT (TopK shortcut) - Q02: project 1 column with tight filter + LIMIT 1 - Q03: tight filter + small projection, no sort - Q04: two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity (~0.005 % match rate), project two columns, no LIMIT or ORDER BY For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x while bloom_filter_eval_time and statistics_eval_time stay flat. bench.sh adds: - data wide_schema: synthesizes both wide and narrow datasets - run wide_schema: runs the wide subgroup, then the narrow baseline subgroup, for query-by-query comparison Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adriangb · 2026-05-01T16:14:54Z

@martin-g mind reviewing this change? cc @Omega359 for use of the new sql benchmark runners

alamb

I tested localyl like this:

nice ./benchmarks/bench.sh data wide_schema && nice ./benchmarks/bench.sh run wide_schema

Looks good


wide_schema/Q01_wide    time:   [107.95 ms 108.65 ms 109.41 ms]
wide_schema/Q02_wide    time:   [9.6010 ms 9.6373 ms 9.6765 ms]
wide_schema/Q03_wide    time:   [11.797 ms 11.878 ms 11.943 ms]
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
wide_schema/Q04_wide    time:   [82.428 ms 83.461 ms 84.784 ms]
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

Loading benchmarks...
Loaded benchmarks in 3 ms ...
wide_schema/Q01_narrow  time:   [59.200 ms 60.021 ms 60.854 ms]
wide_schema/Q02_narrow  time:   [2.6507 ms 2.6881 ms 2.7232 ms]
wide_schema/Q03_narrow  time:   [6.0559 ms 6.3746 ms 6.6960 ms]
wide_schema/Q04_narrow  time:   [21.359 ms 21.589 ms 21.839 ms]

The data files are many smallish Parquet files

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion3$ du -s -h benchmarks/data/wide_schema/*
108M	benchmarks/data/wide_schema/narrow
216M	benchmarks/data/wide_schema/wide

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion3$ du -s -h benchmarks/data/wide_schema/narrow/*
432K	benchmarks/data/wide_schema/narrow/events_0001.parquet
432K	benchmarks/data/wide_schema/narrow/events_0002.parquet
428K	benchmarks/data/wide_schema/narrow/events_0003.parquet
436K	benchmarks/data/wide_schema/narrow/events_0004.parquet
428K	benchmarks/data/wide_schema/narrow/events_0005.parquet
...

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion3$ du -s -h benchmarks/data/wide_schema/wide/*
864K	benchmarks/data/wide_schema/wide/events_0001.parquet
864K	benchmarks/data/wide_schema/wide/events_0002.parquet
864K	benchmarks/data/wide_schema/wide/events_0003.parquet
868K	benchmarks/data/wide_schema/wide/events_0004.parquet
860K	benchmarks/data/wide_schema/wide/events_0005.parquet
868K	benchmarks/data/wide_schema/wide/events_0006.parquet
864K	benchmarks/data/wide_schema/wide/events_0007.parquet
...

> select * from 'benchmarks/data/wide_schema/narrow' limit 10;
+---------+--------------------+-------+------------+----------+------+--------+---------------------------------------------+
| id      | value              | count | ts         | category | flag | status | text                                        |
+---------+--------------------+-------+------------+----------+------+--------+---------------------------------------------+
| 6250000 | 3500.0             | 0     | 1997-06-23 | c1       | f1   | s0     | synthetic event row 0006250000 payload text |
| 6250001 | 3513.7000000029802 | 1     | 1997-06-24 | c2       | f2   | s1     | synthetic event row 0006250001 payload text |
| 6250002 | 3527.3999999910593 | 2     | 1997-06-25 | c3       | f0   | s0     | synthetic event row 0006250002 payload text |
| 6250003 | 3541.0999999940395 | 3     | 1997-06-26 | c4       | f1   | s1     | synthetic event row 0006250003 payload text |
| 6250004 | 3554.7999999970198 | 4     | 1997-06-27 | c5       | f2   | s0     | synthetic event row 0006250004 payload text |
| 6250005 | 3568.5             | 5     | 1997-06-28 | c6       | f0   | s1     | synthetic event row 0006250005 payload text |
| 6250006 | 3582.2000000029802 | 6     | 1997-06-29 | c0       | f1   | s0     | synthetic event row 0006250006 payload text |
| 6250007 | 3595.8999999910593 | 7     | 1997-06-30 | c1       | f2   | s1     | synthetic event row 0006250007 payload text |
| 6250008 | 3609.5999999940395 | 8     | 1997-07-01 | c2       | f0   | s0     | synthetic event row 0006250008 payload text |
| 6250009 | 3623.2999999970198 | 9     | 1997-07-02 | c3       | f1   | s1     | synthetic event row 0006250009 payload text |
+---------+--------------------+-------+------------+----------+------+--------+---------------------------------------------+
10 row(s) fetched.
Elapsed 0.029 seconds.

And the wide schema

> describe 'benchmarks/data/wide_schema/wide';
+--------------+-----------+-------------+
| column_name  | data_type | is_nullable |
+--------------+-----------+-------------+
| id_2         | Int64     | YES         |
| value_2      | Float64   | YES         |
| count_2      | Int64     | YES         |
| ts_2         | Date32    | YES         |
| category_2   | Utf8View  | YES         |
| flag_2       | Utf8View  | YES         |
| status_2     | Utf8View  | YES         |
| text_2       | Utf8View  | YES         |
| id_3         | Int64     | YES         |
| value_3      | Float64   | YES         |
| count_3      | Int64     | YES         |
| ts_3         | Date32    | YES         |
| category_3   | Utf8View  | YES         |
| flag_3       | Utf8View  | YES         |
| status_3     | Utf8View  | YES         |
| text_3       | Utf8View  | YES         |

...
| category_128 | Utf8View  | YES         |
| flag_128     | Utf8View  | YES         |
| status_128   | Utf8View  | YES         |
| text_128     | Utf8View  | YES         |
| id           | Int64     | YES         |
| value        | Float64   | YES         |
| count        | Int64     | YES         |
| ts           | Date32    | YES         |
| category     | Utf8View  | YES         |
| flag         | Utf8View  | YES         |
| status       | Utf8View  | YES         |
| text         | Utf8View  | YES         |
+--------------+-----------+-------------+
1024 row(s) fetched.
Elapsed 0.317 seconds.

alamb · 2026-05-05T13:39:00Z

+//! past all the padding, exercising any per-column-position cost in
+//! the scanner / planner.
+//!
+//! Zero-filled rather than null because the parquet reader can


why not add actual values?

I wanted to avoid the boilerplate and cpu time of generating random data. Do you have a suggestion on what data we should fill in?

adriangb force-pushed the wide-schema-bench branch 9 times, most recently from d996aee to 147617d Compare May 1, 2026 15:42

adriangb force-pushed the wide-schema-bench branch from 147617d to 63eb746 Compare May 1, 2026 15:53

adriangb marked this pull request as ready for review May 1, 2026 15:53

adriangb mentioned this pull request May 1, 2026

Add benchmarks for queries against wide schema parquet files #21968

Open

adriangb requested a review from martin-g May 1, 2026 16:13

alamb reviewed May 5, 2026

View reviewed changes

alamb approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add wide-schema benchmark suite for measuring per-file metadata overhead#21970

Add wide-schema benchmark suite for measuring per-file metadata overhead#21970
adriangb wants to merge 1 commit intoapache:mainfrom
pydantic:wide-schema-bench

adriangb commented May 1, 2026 •

edited

Loading

Uh oh!

adriangb commented May 1, 2026

Uh oh!

alamb left a comment

Uh oh!

alamb May 5, 2026

Uh oh!

adriangb May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adriangb commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb commented May 1, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 5, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriangb commented May 1, 2026 •

edited

Loading