[SPARK-56694][SQL] Fix `DynamicPruningSubquery` canonicalization for `buildKeys` by peter-toth · Pull Request #55644 · apache/spark

peter-toth · 2026-05-01T12:00:59Z

What changes were proposed in this pull request?

DynamicPruningSubquery.canonicalized now normalizes buildKeys relative to buildQuery.output using QueryPlan.normalizeExpressions instead of calling .canonicalized on each key expression independently.

Why are the changes needed?

The previous implementation called buildKeys.map(_.canonicalized), which canonicalized each key expression in isolation and therefore preserved the original ExprId values of attribute references. When two DynamicPruningSubquery instances referenced the same logical build query (e.g. different copies of a CTE branch) but with different ExprIds, their canonical buildKeys differed even though the queries were semantically identical.

QueryPlan.normalizeExpressions(key, buildQuery.output) replaces each attribute reference with ExprId(ordinal) where ordinal is the attribute's position in buildQuery.output. Two copies of the same CTE branch will place the same attribute at the same ordinal, so the canonical buildKeys become identical regardless of the original ExprId values.

Does this PR introduce any user-facing change?

No. This is an internal canonicalization fix. It may improve query plans by enabling PlanMerger to deduplicate more DynamicPruningSubquery expressions, but does not change observable query results.

How was this patch tested?

Added a unit test in DynamicPruningSubquerySuite that constructs two DynamicPruningSubquery instances with identical build query structure but fresh (distinct) ExprIds, and asserts that their canonicalized forms are equal.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

…ildKeys ### What changes were proposed in this pull request? `DynamicPruningSubquery.canonicalized` now normalizes `buildKeys` relative to `buildQuery.output` using `QueryPlan.normalizeExpressions` instead of calling `.canonicalized` on each key expression independently. ### Why are the changes needed? The previous implementation called `buildKeys.map(_.canonicalized)`, which canonicalized each key expression in isolation and therefore preserved the original `ExprId` values of attribute references. When two `DynamicPruningSubquery` instances referenced the same logical build query (e.g. different copies of a CTE branch) but with different `ExprId`s, their canonical `buildKeys` differed even though the queries were semantically identical. `QueryPlan.normalizeExpressions(key, buildQuery.output)` replaces each attribute reference with `ExprId(ordinal)` where `ordinal` is the attribute's position in `buildQuery.output`. Two copies of the same CTE branch will place the same attribute at the same ordinal, so the canonical `buildKeys` become identical regardless of the original `ExprId` values. ### Does this PR introduce _any_ user-facing change? No. This is an internal canonicalization fix. It may improve query plans by enabling `PlanMerger` to deduplicate more `DynamicPruningSubquery` expressions, but does not change observable query results. ### How was this patch tested? Added a unit test in `DynamicPruningSubquerySuite` that constructs two `DynamicPruningSubquery` instances with identical build query structure but fresh (distinct) `ExprId`s, and asserts that their `canonicalized` forms are equal. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6

dongjoon-hyun

+1, LGTM. Thank you, @peter-toth !

peter-toth · 2026-05-01T16:58:36Z

-                                    :  +- ReusedExchange (114)
-                                    +- ReusedExchange (116)
+            :        :     +- BroadcastExchange (41)
+            :        :        +- * Project (40)


This change in the golden plan that the aggregate calculation is moved to a subquery of Project (40) is unnecessary and must be related to some kind of PlanMerger issue.
Since both plans are semantically correct, let me investigate this plan change in a separate ticket and keep this PR as a canonicalization bugfix.

dongjoon-hyun · 2026-05-01T19:10:58Z

Merged to master for Apache Spark 4.2.0 as a kind of improvement.

However, feel free to make a backporting PR if you want to have this in old release branches, @peter-toth .

peter-toth · 2026-05-01T19:14:55Z

Merged to master for Apache Spark 4.2.0 as a kind of improvement.

Thank you @dongjoon-hyun for the review.

dongjoon-hyun approved these changes May 1, 2026

View reviewed changes

regenerate golden files

fed2bea

peter-toth commented May 1, 2026

View reviewed changes

dongjoon-hyun closed this in 9967250 May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56694][SQL] Fix `DynamicPruningSubquery` canonicalization for `buildKeys`#55644

[SPARK-56694][SQL] Fix `DynamicPruningSubquery` canonicalization for `buildKeys`#55644
peter-toth wants to merge 2 commits intoapache:masterfrom
peter-toth:SPARK-56694-fix-dynamicpruningsubquery-canonicalization

peter-toth commented May 1, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

peter-toth May 1, 2026

Uh oh!

dongjoon-hyun commented May 1, 2026

Uh oh!

peter-toth commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peter-toth commented May 1, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth May 1, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented May 1, 2026

Uh oh!

peter-toth commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants