Skip to content

[SPARK-57499][SQL] Variant extraction pushdown bypasses column pruning on DSvs2 scans#56556

Open
qlong wants to merge 1 commit into
apache:masterfrom
qlong:SPARK-57499-variant-pushdown-column-pruning
Open

[SPARK-57499][SQL] Variant extraction pushdown bypasses column pruning on DSvs2 scans#56556
qlong wants to merge 1 commit into
apache:masterfrom
qlong:SPARK-57499-variant-pushdown-column-pruning

Conversation

@qlong

@qlong qlong commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Three fixes in pushVariantExtractions (called byV2ScaqqnRelationPushDown.pushDownVariants):

  1. Guard against double-visit: Add a pushedVariants.isEmpty
    sentinel check so the inner ScanBuilderHolder leaf visit (caused by
    transformDown recursing into the child after returning the plan
    unchanged) returns immediately. This ensures
    builder.pushVariantExtractions is called exactly once per holder.
  2. Eager column pruning: While projectList and filters are in
    scope, call builder.pruneColumns(requiredSchema) for builders
    implementing SupportsPushDownRequiredColumns and trim
    sHolder.output to the required columns. By the time
    buildScanWithPushedVariants calls build(), the builder already
    has the correct pruned schema. This is similiar to how
    buildScanWithPushedAggregate works.
  3. Keep join-condition variant columns raw: Fixing the double-visit
    exposed a latent crash. The rewrite is local (it only sees the
    Project/Filter above the scan, not the enclosing Join), so it cannot
    rewrite a join condition that reads a variant via variant_get.
    Shredding such a column would re-expose it as a GetStructField
    aliased to the original ExprId; RemoveRedundantAliases later
    collapses that alias and breaks the join condition, failing plan
    validation. Drop join-condition variant columns (referenced directly
    or via a local alias like v AS vw) from the variant mapping so they
    stay raw; pruneColumns still removes unreferenced siblings.

Jira: https://issues.apache.org/jira/browse/SPARK-57499

Why are the changes needed?

Two bugs on the accepted variant pushdown path:

Issue 1 — column pruning is skipped. buildScanWithPushedVariants calls
builder.build() and replaces the ScanBuilderHolder with a DataSourceV2ScanRelation.
The subsequent pruneColumns rule matches only ScanBuilderHolder nodes, so it is a
no-op and builder.pruneColumns() is never called. The scan reads the full table schema
including unreferenced columns. For unreferenced VARIANT columns this is especially
costly — each is fully reconstructed from its shredded Parquet tree on every row.

Issue 2 — invalid plan on native Parquet V2. pushDownVariants uses transformDown,
which recurses into the child ScanBuilderHolder after returning the plan unchanged. The
bare ScanBuilderHolder matches PhysicalOperation a second time, collecting unreferenced
sibling VARIANT columns as full-variant requests and pushing them to the builder again.
ParquetScanBuilder overwrites its state on every call, so the second push clobbers the
correct extraction from the first, producing a dangling ExprId in the projection:

  !Project [variant_get(v1#57, $.x) ...]    -- stale ExprId, marked invalid
  +- BatchScan parquet [a#66, v1#67, v2#68]
     PushedVariantExtractions: [v2:"$"]     -- sibling variant pushed, not v1:$.x

This causes a runtime failure:

  [INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND] Could not find v1#57 in [a#72,v1#73,v2#74]

Both issues affect DSv2 sources implementing SupportsPushDownVariantExtractions and
are gated on accepted variant pushdown. When pushdown is declined or disabled the
ScanBuilderHolder survives and pruneColumns runs normally.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Added new unit tests
  • manual testing with spark-sql

Was this patch authored or co-authored using generative AI tooling?

Co-authored with Claude code (Sonnet 4.6)

@qlong qlong changed the title [SPARK-57499][SQL] Variant extraction pushdown bypasses column pruning on DSvs scans [SPARK-57499][SQL] Variant extraction pushdown bypasses column pruning on DSvs2 scans Jun 17, 2026
…g on DSv2 scans

Three fixes in `pushVariantExtractions` (called by
`V2ScaqqnRelationPushDown.pushDownVariants`):

1. **Guard against double-visit**: Add a `pushedVariants.isEmpty`
   sentinel check so the inner `ScanBuilderHolder` leaf visit (caused by
   `transformDown` recursing into the child after returning the plan
   unchanged) returns immediately. This ensures
   `builder.pushVariantExtractions` is called exactly once per holder.

2. **Eager column pruning**: While `projectList` and `filters` are in
   scope, call `builder.pruneColumns(requiredSchema)` for builders
   implementing `SupportsPushDownRequiredColumns` and trim
   `sHolder.output` to the required columns. By the time
   `buildScanWithPushedVariants` calls `build()`, the builder already
   has the correct pruned schema. This is similiar to how
   buildScanWithPushedAggregate works.

3. **Keep join-condition variant columns raw**: Fixing the double-visit
   exposed a latent crash. The rewrite is local (it only sees the
   Project/Filter above the scan, not the enclosing Join), so it cannot
   rewrite a join condition that reads a variant via `variant_get`.
   Shredding such a column would re-expose it as a `GetStructField`
   aliased to the original ExprId; `RemoveRedundantAliases` later
   collapses that alias and breaks the join condition, failing plan
   validation. Drop join-condition variant columns (referenced directly
   or via a local alias like `v AS vw`) from the variant mapping so they
   stay raw; `pruneColumns` still removes unreferenced siblings.

Jira: https://issues.apache.org/jira/browse/SPARK-57499
@qlong qlong force-pushed the SPARK-57499-variant-pushdown-column-pruning branch from 5abbc88 to 34f3787 Compare June 17, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant