Skip to content

[RDF] Always write nominal columns in Snapshot with variations.#21246

Merged
hageboeck merged 2 commits into
root-project:masterfrom
hageboeck:RDF_snapshot_twoVariations
Apr 28, 2026
Merged

[RDF] Always write nominal columns in Snapshot with variations.#21246
hageboeck merged 2 commits into
root-project:masterfrom
hageboeck:RDF_snapshot_twoVariations

Conversation

@hageboeck
Copy link
Copy Markdown
Member

@hageboeck hageboeck commented Feb 11, 2026

Before, the nominal columns were omitted from the output if their filter
didn't pass. This is, however, insufficient, because if at least two
variations are run, (some of) the nominal columns might still be in use.

One example is this setup:

df.Vary("x", [...], "var0")
  .Vary("y", [...], "var1")
  .Filter([...], {"x", "y"})
  .Snapshot([...], {"x", "y"})

In every variation universe, a Filter on x and y is used. If one writes down
what the value of the nominal and varied columns should be in the snapshot,
one gets conflicting requirements for the nominal columns if one keeps zeroeing
the invalid columns to save space as was done before.

Imagine a dataset where only the var0 variation passes the filter. One gets:

x y x_var0 y_var1
Filter fail fail pass fail
Should the columns be valid? No, because nominal didn't pass No, because nominal didn't pass, but yes because var0 passed Yes No
What's written before this PR 0 0 x_var0 0
What's written now x y x_var0 0 (invalid)

The problem is the column y:
In the nominal universe, it shouldn't be present, but since it's used as an unvaried column also in the var0 universe, it must be in the output dataset. The same argument can be made for x if var1 were to pass. Therefore, the nominal columns must remain visible in the output dataset as soon as any filter in any variation passes.

The tests have been updated to reflect the fact that the nominal columns are not zeroed out.

@hageboeck hageboeck self-assigned this Feb 11, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 11, 2026

Test Results

    22 files      22 suites   3d 9h 49m 23s ⏱️
 3 849 tests  3 847 ✅  1 💤 1 ❌
76 910 runs  76 891 ✅ 18 💤 1 ❌

For more details on these failures, see this check.

Results for commit d7414e6.

♻️ This comment has been updated with latest results.

@hageboeck hageboeck closed this Apr 21, 2026
@hageboeck hageboeck reopened this Apr 21, 2026
@hageboeck hageboeck force-pushed the RDF_snapshot_twoVariations branch from dab901e to e6d8632 Compare April 27, 2026 09:47
@hageboeck hageboeck marked this pull request as ready for review April 27, 2026 09:48
Before, the nominal columns were omitted from the output if their filter
didn't pass. This is, however, insufficient, because if at least two
variations are run, (some of) the nominal columns might still be in use.

The tests have been updated to reflect the fact that the nominal columns
are not zeroed out.
@hageboeck hageboeck force-pushed the RDF_snapshot_twoVariations branch 2 times, most recently from 545855a to 899088e Compare April 27, 2026 09:51
Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! This will enable more use cases for the Snapshot with variations 👍 Changes LGTM, I left a couple of comments in the test only

Comment thread tree/dataframe/test/dataframe_snapshotWithVariations.cxx
Comment thread tree/dataframe/test/dataframe_snapshotWithVariations.cxx Outdated
Comment thread tree/dataframe/test/dataframe_snapshotWithVariations.cxx
Comment thread tree/dataframe/test/dataframe_snapshotWithVariations.cxx Outdated
Comment thread tree/dataframe/test/dataframe_snapshotWithVariations.cxx
Comment thread tree/dataframe/test/dataframe_snapshotWithVariations.cxx Outdated
Comment thread tree/dataframe/test/dataframe_snapshotWithVariations.cxx
@hageboeck hageboeck force-pushed the RDF_snapshot_twoVariations branch from 899088e to ce5b8f8 Compare April 27, 2026 14:15
When two variations are run, e.g. x_var and y_var, and the computation
graph uses both x and y, the pairs {x, y_var} and {x_var, y} need to be
visible in the output file. This means that the nominal columsn always
need to be written, even if {x, y} would not pass the filter.
@hageboeck hageboeck force-pushed the RDF_snapshot_twoVariations branch from ce5b8f8 to d7414e6 Compare April 27, 2026 14:16
@hageboeck hageboeck merged commit 6ce7930 into root-project:master Apr 28, 2026
51 of 53 checks passed
@hageboeck hageboeck deleted the RDF_snapshot_twoVariations branch April 28, 2026 14:42
@hageboeck
Copy link
Copy Markdown
Member Author

/backport to 6.40

@root-project-bot
Copy link
Copy Markdown

Preparing to backport PR #21246 to branch 6.40 requested by hageboeck

@root-project-bot
Copy link
Copy Markdown

This PR has been backported to branch 6.40: #22089

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants