Add wide-schema benchmark suite for measuring per-file metadata overhead#21970
Open
adriangb wants to merge 1 commit intoapache:mainfrom
Open
Add wide-schema benchmark suite for measuring per-file metadata overhead#21970adriangb wants to merge 1 commit intoapache:mainfrom
adriangb wants to merge 1 commit intoapache:mainfrom
Conversation
d996aee to
147617d
Compare
Adds a new sql_benchmarks suite that isolates the wide-schema scan
overhead in selective parquet queries: the regime where most of the
work is loading footers / column-chunk metadata rather than reading
row data, and that cost scales with the number of column chunks in
the dataset rather than with the number of columns the query touches.
The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP):
- wide: 1024 cols x 256 files x 50k rows (~225 MB) — the workload
- narrow: 8 cols x 256 files x 50k rows (~110 MB) — internal
baseline, only meaningful as a comparison point
Both share row count, file count, and per-file row-group structure;
only schema width differs. All 4 queries run on both subgroups so
every wide number has a directly comparable narrow baseline.
A new gen_wide_data binary synthesizes both datasets in ~60 s with no
external data source. The 8-column base schema (id, value, count, ts,
category, flag, status, text) carries deterministic data; copies 2..N
from the suffix-renamed widening are zero-filled (zero rather than
null since reader-side null-array shortcuts mute the slowdown by
~35 %).
Query coverage:
- Q01: filter + project + ORDER BY + LIMIT (TopK shortcut)
- Q02: project 1 column with tight filter + LIMIT 1
- Q03: tight filter + small projection, no sort
- Q04: two low-cardinality string filters + a non-stat-prunable
modulo predicate for tight selectivity (~0.005 % match rate),
project two columns, no LIMIT or ORDER BY
For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown
wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x
while bloom_filter_eval_time and statistics_eval_time stay flat.
bench.sh adds:
- data wide_schema: synthesizes both wide and narrow datasets
- run wide_schema: runs the wide subgroup, then the narrow
baseline subgroup, for query-by-query comparison
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
147617d to
63eb746
Compare
Contributor
Author
alamb
reviewed
May 5, 2026
Contributor
alamb
left a comment
There was a problem hiding this comment.
I tested localyl like this:
nice ./benchmarks/bench.sh data wide_schema && nice ./benchmarks/bench.sh run wide_schemaLooks good
wide_schema/Q01_wide time: [107.95 ms 108.65 ms 109.41 ms]
wide_schema/Q02_wide time: [9.6010 ms 9.6373 ms 9.6765 ms]
wide_schema/Q03_wide time: [11.797 ms 11.878 ms 11.943 ms]
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) low mild
wide_schema/Q04_wide time: [82.428 ms 83.461 ms 84.784 ms]
Found 2 outliers among 10 measurements (20.00%)
1 (10.00%) high mild
1 (10.00%) high severe
Loading benchmarks...
Loaded benchmarks in 3 ms ...
wide_schema/Q01_narrow time: [59.200 ms 60.021 ms 60.854 ms]
wide_schema/Q02_narrow time: [2.6507 ms 2.6881 ms 2.7232 ms]
wide_schema/Q03_narrow time: [6.0559 ms 6.3746 ms 6.6960 ms]
wide_schema/Q04_narrow time: [21.359 ms 21.589 ms 21.839 ms]
The data files are many smallish Parquet files
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion3$ du -s -h benchmarks/data/wide_schema/*
108M benchmarks/data/wide_schema/narrow
216M benchmarks/data/wide_schema/wide
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion3$ du -s -h benchmarks/data/wide_schema/narrow/*
432K benchmarks/data/wide_schema/narrow/events_0001.parquet
432K benchmarks/data/wide_schema/narrow/events_0002.parquet
428K benchmarks/data/wide_schema/narrow/events_0003.parquet
436K benchmarks/data/wide_schema/narrow/events_0004.parquet
428K benchmarks/data/wide_schema/narrow/events_0005.parquet
...
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion3$ du -s -h benchmarks/data/wide_schema/wide/*
864K benchmarks/data/wide_schema/wide/events_0001.parquet
864K benchmarks/data/wide_schema/wide/events_0002.parquet
864K benchmarks/data/wide_schema/wide/events_0003.parquet
868K benchmarks/data/wide_schema/wide/events_0004.parquet
860K benchmarks/data/wide_schema/wide/events_0005.parquet
868K benchmarks/data/wide_schema/wide/events_0006.parquet
864K benchmarks/data/wide_schema/wide/events_0007.parquet
...
> select * from 'benchmarks/data/wide_schema/narrow' limit 10;
+---------+--------------------+-------+------------+----------+------+--------+---------------------------------------------+
| id | value | count | ts | category | flag | status | text |
+---------+--------------------+-------+------------+----------+------+--------+---------------------------------------------+
| 6250000 | 3500.0 | 0 | 1997-06-23 | c1 | f1 | s0 | synthetic event row 0006250000 payload text |
| 6250001 | 3513.7000000029802 | 1 | 1997-06-24 | c2 | f2 | s1 | synthetic event row 0006250001 payload text |
| 6250002 | 3527.3999999910593 | 2 | 1997-06-25 | c3 | f0 | s0 | synthetic event row 0006250002 payload text |
| 6250003 | 3541.0999999940395 | 3 | 1997-06-26 | c4 | f1 | s1 | synthetic event row 0006250003 payload text |
| 6250004 | 3554.7999999970198 | 4 | 1997-06-27 | c5 | f2 | s0 | synthetic event row 0006250004 payload text |
| 6250005 | 3568.5 | 5 | 1997-06-28 | c6 | f0 | s1 | synthetic event row 0006250005 payload text |
| 6250006 | 3582.2000000029802 | 6 | 1997-06-29 | c0 | f1 | s0 | synthetic event row 0006250006 payload text |
| 6250007 | 3595.8999999910593 | 7 | 1997-06-30 | c1 | f2 | s1 | synthetic event row 0006250007 payload text |
| 6250008 | 3609.5999999940395 | 8 | 1997-07-01 | c2 | f0 | s0 | synthetic event row 0006250008 payload text |
| 6250009 | 3623.2999999970198 | 9 | 1997-07-02 | c3 | f1 | s1 | synthetic event row 0006250009 payload text |
+---------+--------------------+-------+------------+----------+------+--------+---------------------------------------------+
10 row(s) fetched.
Elapsed 0.029 seconds.And the wide schema
> describe 'benchmarks/data/wide_schema/wide';
+--------------+-----------+-------------+
| column_name | data_type | is_nullable |
+--------------+-----------+-------------+
| id_2 | Int64 | YES |
| value_2 | Float64 | YES |
| count_2 | Int64 | YES |
| ts_2 | Date32 | YES |
| category_2 | Utf8View | YES |
| flag_2 | Utf8View | YES |
| status_2 | Utf8View | YES |
| text_2 | Utf8View | YES |
| id_3 | Int64 | YES |
| value_3 | Float64 | YES |
| count_3 | Int64 | YES |
| ts_3 | Date32 | YES |
| category_3 | Utf8View | YES |
| flag_3 | Utf8View | YES |
| status_3 | Utf8View | YES |
| text_3 | Utf8View | YES |
...
| category_128 | Utf8View | YES |
| flag_128 | Utf8View | YES |
| status_128 | Utf8View | YES |
| text_128 | Utf8View | YES |
| id | Int64 | YES |
| value | Float64 | YES |
| count | Int64 | YES |
| ts | Date32 | YES |
| category | Utf8View | YES |
| flag | Utf8View | YES |
| status | Utf8View | YES |
| text | Utf8View | YES |
+--------------+-----------+-------------+
1024 row(s) fetched.
Elapsed 0.317 seconds.| //! past all the padding, exercising any per-column-position cost in | ||
| //! the scanner / planner. | ||
| //! | ||
| //! Zero-filled rather than null because the parquet reader can |
Contributor
Author
There was a problem hiding this comment.
I wanted to avoid the boilerplate and cpu time of generating random data. Do you have a suggestion on what data we should fill in?
alamb
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
#21968
Rationale for this change
Adds a benchmark suite that isolates a wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query references. Existing benchmarks don't exercise this shape — most TPC-H/ClickBench queries either touch many columns or filter heavily enough that scan work dominates. We need a focused benchmark so this kind of regression is measurable in CI and so optimizations to the wide-schema scan path can be validated.
What changes are included in this PR?
A new sql_benchmarks suite,
wide_schema/(underbenchmarks/sql_benchmarks/), with two subgroups selected viaBENCH_SUBGROUP:wide— runs against a synthesized wide dataset (1024 cols × 256 files × 50 k rows, ~225 MB). This is the actual workload.narrow— runs the same SQL against an 8-col version of the same dataset (same row count, file count, per-file row-group shape, ~110 MB). This subgroup exists only as a baseline for the wide subgroup; reading its numbers in isolation tells you very little. The per-query wide-vs-narrow ratio is what isolates the schema-width cost.All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline.
A new binary,
gen_wide_data(inbenchmarks/src/bin/), synthesizes both datasets in ~60 s with no external data dependency. The 8-column base schema is generic (id,value,count,ts,category,flag,status,text) and carries deterministic synthetic data; the suffix-renamed copies (id_2,id_3, …,id_128, etc.) are zero-filled. Two design notes:Query coverage:
Q01— filter + project +ORDER BY+LIMIT(TopK shortcut)Q02— project 1 column with a tight filter andLIMIT 1Q03— tight filter + small projection, no sortQ04— two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity, project two columns, noLIMITorORDER BYbench.shadditions:Are these changes tested?
Yes — measurements on a M-series Mac (12-way parallel scan, hot OS cache).
Criterion (3 s warmup, 10 samples, median):
Cold-start
datafusion-cli(Q04 shape, median of 3):EXPLAIN ANALYZE phase deltas (Q04, cumulative across 12 scan tasks):
Same qualitative shape:
metadata_load_timeand per-file setup scale with column-chunk count; predicate-evaluation phases stay flat regardless of schema width.cargo fmt --allandcargo clippy --bin gen_wide_data --all-features -- -D warningsare clean.Are there any user-facing changes?
No public API changes. New benchmark suite + new utility binary under
benchmarks/.