[SPARK-56694][SQL] Fix DynamicPruningSubquery canonicalization for buildKeys#55644
Closed
peter-toth wants to merge 2 commits intoapache:masterfrom
Closed
Conversation
…ildKeys ### What changes were proposed in this pull request? `DynamicPruningSubquery.canonicalized` now normalizes `buildKeys` relative to `buildQuery.output` using `QueryPlan.normalizeExpressions` instead of calling `.canonicalized` on each key expression independently. ### Why are the changes needed? The previous implementation called `buildKeys.map(_.canonicalized)`, which canonicalized each key expression in isolation and therefore preserved the original `ExprId` values of attribute references. When two `DynamicPruningSubquery` instances referenced the same logical build query (e.g. different copies of a CTE branch) but with different `ExprId`s, their canonical `buildKeys` differed even though the queries were semantically identical. `QueryPlan.normalizeExpressions(key, buildQuery.output)` replaces each attribute reference with `ExprId(ordinal)` where `ordinal` is the attribute's position in `buildQuery.output`. Two copies of the same CTE branch will place the same attribute at the same ordinal, so the canonical `buildKeys` become identical regardless of the original `ExprId` values. ### Does this PR introduce _any_ user-facing change? No. This is an internal canonicalization fix. It may improve query plans by enabling `PlanMerger` to deduplicate more `DynamicPruningSubquery` expressions, but does not change observable query results. ### How was this patch tested? Added a unit test in `DynamicPruningSubquerySuite` that constructs two `DynamicPruningSubquery` instances with identical build query structure but fresh (distinct) `ExprId`s, and asserts that their `canonicalized` forms are equal. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Sonnet 4.6
dongjoon-hyun
approved these changes
May 1, 2026
Member
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @peter-toth !
peter-toth
commented
May 1, 2026
| : +- ReusedExchange (114) | ||
| +- ReusedExchange (116) | ||
| : : +- BroadcastExchange (41) | ||
| : : +- * Project (40) |
Contributor
Author
There was a problem hiding this comment.
This change in the golden plan that the aggregate calculation is moved to a subquery of Project (40) is unnecessary and must be related to some kind of PlanMerger issue.
Since both plans are semantically correct, let me investigate this plan change in a separate ticket and keep this PR as a canonicalization bugfix.
Member
|
Merged to master for Apache Spark 4.2.0 as a kind of improvement. However, feel free to make a backporting PR if you want to have this in old release branches, @peter-toth . |
Contributor
Author
Thank you @dongjoon-hyun for the review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
DynamicPruningSubquery.canonicalizednow normalizesbuildKeysrelative tobuildQuery.outputusingQueryPlan.normalizeExpressionsinstead of calling.canonicalizedon each key expression independently.Why are the changes needed?
The previous implementation called
buildKeys.map(_.canonicalized), which canonicalized each key expression in isolation and therefore preserved the originalExprIdvalues of attribute references. When twoDynamicPruningSubqueryinstances referenced the same logical build query (e.g. different copies of a CTE branch) but with differentExprIds, their canonicalbuildKeysdiffered even though the queries were semantically identical.QueryPlan.normalizeExpressions(key, buildQuery.output)replaces each attribute reference withExprId(ordinal)whereordinalis the attribute's position inbuildQuery.output. Two copies of the same CTE branch will place the same attribute at the same ordinal, so the canonicalbuildKeysbecome identical regardless of the originalExprIdvalues.Does this PR introduce any user-facing change?
No. This is an internal canonicalization fix. It may improve query plans by enabling
PlanMergerto deduplicate moreDynamicPruningSubqueryexpressions, but does not change observable query results.How was this patch tested?
Added a unit test in
DynamicPruningSubquerySuitethat constructs twoDynamicPruningSubqueryinstances with identical build query structure but fresh (distinct)ExprIds, and asserts that theircanonicalizedforms are equal.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6