[SPARK-57499][SQL] Variant extraction pushdown bypasses column pruning on DSvs2 scans#56556
Open
qlong wants to merge 1 commit into
Open
[SPARK-57499][SQL] Variant extraction pushdown bypasses column pruning on DSvs2 scans#56556qlong wants to merge 1 commit into
qlong wants to merge 1 commit into
Conversation
…g on DSv2 scans Three fixes in `pushVariantExtractions` (called by `V2ScaqqnRelationPushDown.pushDownVariants`): 1. **Guard against double-visit**: Add a `pushedVariants.isEmpty` sentinel check so the inner `ScanBuilderHolder` leaf visit (caused by `transformDown` recursing into the child after returning the plan unchanged) returns immediately. This ensures `builder.pushVariantExtractions` is called exactly once per holder. 2. **Eager column pruning**: While `projectList` and `filters` are in scope, call `builder.pruneColumns(requiredSchema)` for builders implementing `SupportsPushDownRequiredColumns` and trim `sHolder.output` to the required columns. By the time `buildScanWithPushedVariants` calls `build()`, the builder already has the correct pruned schema. This is similiar to how buildScanWithPushedAggregate works. 3. **Keep join-condition variant columns raw**: Fixing the double-visit exposed a latent crash. The rewrite is local (it only sees the Project/Filter above the scan, not the enclosing Join), so it cannot rewrite a join condition that reads a variant via `variant_get`. Shredding such a column would re-expose it as a `GetStructField` aliased to the original ExprId; `RemoveRedundantAliases` later collapses that alias and breaks the join condition, failing plan validation. Drop join-condition variant columns (referenced directly or via a local alias like `v AS vw`) from the variant mapping so they stay raw; `pruneColumns` still removes unreferenced siblings. Jira: https://issues.apache.org/jira/browse/SPARK-57499
5abbc88 to
34f3787
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Three fixes in
pushVariantExtractions(called byV2ScaqqnRelationPushDown.pushDownVariants):pushedVariants.isEmptysentinel check so the inner
ScanBuilderHolderleaf visit (caused bytransformDownrecursing into the child after returning the planunchanged) returns immediately. This ensures
builder.pushVariantExtractionsis called exactly once per holder.projectListandfiltersare inscope, call
builder.pruneColumns(requiredSchema)for buildersimplementing
SupportsPushDownRequiredColumnsand trimsHolder.outputto the required columns. By the timebuildScanWithPushedVariantscallsbuild(), the builder alreadyhas the correct pruned schema. This is similiar to how
buildScanWithPushedAggregate works.
exposed a latent crash. The rewrite is local (it only sees the
Project/Filter above the scan, not the enclosing Join), so it cannot
rewrite a join condition that reads a variant via
variant_get.Shredding such a column would re-expose it as a
GetStructFieldaliased to the original ExprId;
RemoveRedundantAliaseslatercollapses that alias and breaks the join condition, failing plan
validation. Drop join-condition variant columns (referenced directly
or via a local alias like
v AS vw) from the variant mapping so theystay raw;
pruneColumnsstill removes unreferenced siblings.Jira: https://issues.apache.org/jira/browse/SPARK-57499
Why are the changes needed?
Two bugs on the accepted variant pushdown path:
Issue 1 — column pruning is skipped.
buildScanWithPushedVariantscallsbuilder.build()and replaces theScanBuilderHolderwith aDataSourceV2ScanRelation.The subsequent
pruneColumnsrule matches onlyScanBuilderHoldernodes, so it is ano-op and
builder.pruneColumns()is never called. The scan reads the full table schemaincluding unreferenced columns. For unreferenced
VARIANTcolumns this is especiallycostly — each is fully reconstructed from its shredded Parquet tree on every row.
Issue 2 — invalid plan on native Parquet V2.
pushDownVariantsusestransformDown,which recurses into the child
ScanBuilderHolderafter returning the plan unchanged. Thebare
ScanBuilderHoldermatchesPhysicalOperationa second time, collecting unreferencedsibling
VARIANTcolumns as full-variant requests and pushing them to the builder again.ParquetScanBuilderoverwrites its state on every call, so the second push clobbers thecorrect extraction from the first, producing a dangling
ExprIdin the projection:This causes a runtime failure:
Both issues affect DSv2 sources implementing
SupportsPushDownVariantExtractionsandare gated on accepted variant pushdown. When pushdown is declined or disabled the
ScanBuilderHoldersurvives andpruneColumnsruns normally.Does this PR introduce any user-facing change?
No
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Co-authored with Claude code (Sonnet 4.6)