Skip to content

feat: type-keyed extensions map for PartitionedFile#21993

Open
adriangb wants to merge 7 commits intoapache:mainfrom
pydantic:worktree-partitioned-file-extensions-map
Open

feat: type-keyed extensions map for PartitionedFile#21993
adriangb wants to merge 7 commits intoapache:mainfrom
pydantic:worktree-partitioned-file-extensions-map

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 2, 2026

Which issue does this PR close?

  • Closes #.

(no tracking issue yet; happy to file one if this direction is wanted)

Rationale for this change

PartitionedFile.extensions is currently a single Option<Arc<dyn Any + Send + Sync>> slot. That means two independent components cannot both attach their own per-file data without colliding — only one of ParquetAccessPlan, a custom index entry, a custom reader payload, etc. can be present at a time. Anyone wanting more than one piece of metadata has to define a wrapper struct and have every consumer agree on it, which is not possible because consumers are hard coded in DataFusion's code.

While exploring this, I noticed DataFusion already has the same HashMap<TypeId, Arc<dyn Any + Send + Sync>> pattern in two other places:

  • datafusion/execution/src/config.rs — backs SessionConfig.extensions, with a clever IdHasher that avoids rehashing TypeIds.
  • datafusion/physical-plan/src/operator_statistics/mod.rs — backs ExtendedStatistics.extensions.

Rather than add a third copy, this PR lifts the pattern into datafusion-common as datafusion_common::extensions::Extensions (preserving the IdHasher optimization) and migrates all three consumers to it.

What changes are included in this PR?

This PR is split into two stacked commits:

Commit 1 — feat: type-keyed extensions map for PartitionedFile

  • New datafusion_common::extensions::Extensions (TypeId-keyed HashMap with the IdHasher lifted from SessionConfig). API: insert / insert_arc / get / get_arc / contains / remove / merge, all generic over T.
  • PartitionedFile.extensions changes from Option<Arc<dyn Any + Send + Sync>> to FileExtensions (re-export of Extensions).
  • New typed builder methods: with_extension<T>(T) and with_extension_arc<T>(Arc<T>) replace with_extensions(Arc<dyn Any + Send + Sync>). New accessor extension::<T>() avoids manual downcasts.
  • Parquet opener migrated: create_initial_plan now takes &FileExtensions and uses get::<ParquetAccessPlan>() directly instead of doing the downcast inline.
  • In-tree call sites migrated: parquet/custom_reader.rs test, parquet/external_access_plan.rs test, and the parquet_advanced_index.rs example.
  • Upgrade guide entry added in docs/source/library-user-guide/upgrading/54.0.0.md.

Commit 2 — refactor: migrate SessionConfig and ExtendedStatistics to shared Extensions

  • SessionConfig.extensions switches from the inline AnyMap / IdHasher to Extensions.
  • ExtendedStatistics.extensions switches from inline HashMap<TypeId, Arc<dyn Any + Send + Sync>> to Extensions.
  • Both keep their existing public APIs (with_extension / set_extension / get_extension / has_extension / merge_extensions) — this commit is a pure refactor with no user-visible change.
  • Deletes the duplicated AnyMap type alias and IdHasher impl from datafusion/execution/src/config.rs.

Are these changes tested?

  • New unit tests for Extensions (insert/get/remove/replace, merge precedence) in datafusion-common.
  • Existing parquet tests that exercised the single-slot API (parquet::custom_reader::route_data_access_ops_to_parquet_file_reader_factory, all 10 parquet::external_access_plan::*) pass after migration.
  • cargo test -p datafusion-execution -p datafusion-physical-plan — all passing (1397 + 63 + 14 + 10 unit/doc tests).
  • Full workspace cargo check --workspace --all-targets clean.
  • ./dev/rust_lint.sh clean.

Are there any user-facing changes?

Yes — breaking public-API change on PartitionedFile (commit 1):

-let pf = PartitionedFile::new(path, size)
-    .with_extensions(Arc::new(access_plan));
+let pf = PartitionedFile::new(path, size)
+    .with_extension(access_plan);
-let plan = pf.extensions
-    .as_ref()
-    .and_then(|ext| ext.downcast_ref::<ParquetAccessPlan>());
+let plan = pf.extension::<ParquetAccessPlan>();

The extensions field's public type also changes (Option<Arc<dyn Any>>FileExtensions). Upgrade guide entry added.

Commit 2 has no user-visible API changes — SessionConfig and ExtendedStatistics keep their existing extension method signatures.

Open questions

  • Naming: Extensions vs AnyMap vs TypeMap. I went with Extensions to match the existing field names; open to bikeshedding.

🤖 Generated with Claude Code

PartitionedFile.extensions previously held a single
Option<Arc<dyn Any + Send + Sync>>, so two independent components could
not both attach data to the same file without colliding. This commit
introduces a shared TypeId-keyed Extensions map in datafusion-common and
uses it for PartitionedFile so each Rust type occupies its own slot.

The shared datafusion_common::extensions::Extensions type also gives us
a single home for this pattern; SessionConfig.extensions and
ExtendedStatistics.extensions can be migrated to it in follow-ups
(both are TypeId-keyed maps duplicating the same shape today).

Public API changes on PartitionedFile:
- with_extensions(Arc<dyn Any + Send + Sync>) → with_extension<T>(T) and
  with_extension_arc<T>(Arc<T>)
- extensions field type Option<Arc<dyn Any + Send + Sync>> → FileExtensions
- new extension::<T>() accessor avoids manual downcasts

Migrated in-tree call sites (parquet opener, custom_reader test,
external_access_plan test, parquet_advanced_index example) and added an
upgrade guide entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation core Core DataFusion crate common Related to common crate datasource Changes to the datasource crate labels May 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 2, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  81.135s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.034s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  80.601s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.033s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.617s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 164.550s] datafusion
    Building datafusion-common v53.1.0 (current)
       Built [  32.299s] (current)
     Parsing datafusion-common v53.1.0 (current)
      Parsed [   0.056s] (current)
    Building datafusion-common v53.1.0 (baseline)
       Built [  32.053s] (baseline)
     Parsing datafusion-common v53.1.0 (baseline)
      Parsed [   0.056s] (baseline)
    Checking datafusion-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.630s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  66.232s] datafusion-common
    Building datafusion-datasource v53.1.0 (current)
       Built [  35.343s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.028s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  35.290s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.030s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.254s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure type_method_marked_deprecated: type method #[deprecated] added ---

Description:
A type method is now #[deprecated]. Downstream crates will get a compiler warning when using this method.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-deprecated-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/type_method_marked_deprecated.ron

Failed in:
  method datafusion_datasource::PartitionedFile::with_extensions in /home/runner/work/datafusion/datafusion/datafusion/datasource/src/mod.rs:295

     Summary semver requires new minor version: 0 major and 1 minor checks failed
    Finished [  72.397s] datafusion-datasource
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  41.438s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.024s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  42.199s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.025s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.134s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  86.043s] datafusion-datasource-parquet
    Building datafusion-execution v53.1.0 (current)
       Built [  29.294s] (current)
     Parsing datafusion-execution v53.1.0 (current)
      Parsed [   0.024s] (current)
    Building datafusion-execution v53.1.0 (baseline)
       Built [  29.452s] (baseline)
     Parsing datafusion-execution v53.1.0 (baseline)
      Parsed [   0.025s] (baseline)
    Checking datafusion-execution v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.196s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  60.707s] datafusion-execution
    Building datafusion-physical-plan v53.1.0 (current)
       Built [  32.381s] (current)
     Parsing datafusion-physical-plan v53.1.0 (current)
      Parsed [   0.121s] (current)
    Building datafusion-physical-plan v53.1.0 (baseline)
       Built [  32.748s] (baseline)
     Parsing datafusion-physical-plan v53.1.0 (baseline)
      Parsed [   0.121s] (baseline)
    Checking datafusion-physical-plan v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.578s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  67.499s] datafusion-physical-plan

…nsions

Both SessionConfig.extensions and ExtendedStatistics.extensions were
already TypeId-keyed Arc-of-Any maps, duplicating the type added in the
previous commit for PartitionedFile.extensions. Replace both with the
shared datafusion_common::extensions::Extensions, deleting the inline
AnyMap type alias and IdHasher (the IdHasher optimization is preserved
inside Extensions).

No public API changes:
- SessionConfig::with_extension / set_extension / get_extension keep
  the same signatures.
- ExtendedStatistics::get_extension / set_extension / has_extension /
  merge_extensions keep the same signatures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added execution Related to the execution crate physical-plan Changes to the physical-plan crate labels May 2, 2026
- Collapse Extensions::insert / insert_arc into a single insert<T>(Arc<T>).
  Values are always Arc'd internally (the map needs Clone, dyn Any isn't),
  but the call site signature is now consistent across all consumers
  (PartitionedFile, SessionConfig, ExtendedStatistics) — caller wraps
  in Arc explicitly.
- Drop PartitionedFile::with_extension_arc; the single with_extension<T>
  takes Arc<T> directly.
- Change ExtendedStatistics::set_extension<T>(value: T) to take Arc<T>
  for the same reason.
- Re-add PartitionedFile::with_extensions(Arc<dyn Any + Send + Sync>) as
  a #[deprecated] shim that keys by the value's dynamic TypeId, restoring
  SemVer compatibility for code targeting 53.x. Backed by a new
  Extensions::insert_dyn(Arc<dyn Any + Send + Sync>) helper.
- Update parquet call sites (test, example, docstring) to wrap in Arc.
- Tweak the upgrade guide entry to reflect the deprecation and final API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb marked this pull request as ready for review May 3, 2026 13:47
@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented May 3, 2026

@kosiew mind reviewing this change please?

@adriangb adriangb requested a review from kosiew May 3, 2026 13:49
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb
Thanks for the refactor. I think there is one API compatibility issue that should be addressed before this lands. I also left one test suggestion that would help cover the main end-to-end invariant.

/// Set a custom statistics extension.
pub fn set_extension<T: 'static + Send + Sync>(&mut self, value: T) {
self.extensions.insert(TypeId::of::<T>(), Arc::new(value));
pub fn set_extension<T: 'static + Send + Sync>(&mut self, value: Arc<T>) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtendedStatistics::set_extension now asks callers to allocate and pass an Arc<T>, while the existing public API accepted T and wrapped it internally. That makes this a source-breaking API change, even though the refactor is mainly about sharing the backing map.

The documented example nearby still shows stats.set_extension(HistogramStats { ... }), so existing users following that pattern will no longer compile. Could we keep the existing method shape, for example set_extension<T>(value: T) { self.extensions.insert(Arc::new(value)); }, and add a separate set_extension_arc only if callers need to avoid the allocation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -71,7 +71,7 @@ async fn route_data_access_ops_to_parquet_file_reader_factory() {
.into_iter()
.map(|meta| {
PartitionedFile::new_from_meta(meta)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to add one integration regression test that attaches both a ParquetAccessPlan and a custom reader payload to the same PartitionedFile. The test could then assert that the custom reader still sees its payload and that the parquet access plan is honored.

The current tests cover the generic map and the two consumers separately, but not quite the end-to-end invariant this PR is trying to protect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adriangb and others added 2 commits May 4, 2026 08:29
…ariants

Restores `ExtendedStatistics::set_extension<T>(value: T)` (was source-breaking
in this PR's previous revision when it switched to `Arc<T>`) and applies the
same pattern to `PartitionedFile::with_extension`. Callers who already have
an `Arc<T>` and want to skip the extra allocation can use the new
`set_extension_arc` / `with_extension_arc` variants. `Extensions` itself
gains the same split: `insert(T)` + `insert_arc(Arc<T>)`.

A single method that accepts both `T` and `Arc<T>` is not expressible in
stable Rust — both `impl Into<Arc<T>>` and a custom trait with two blanket
impls leave the compiler unable to choose between `T = X` and `T = Arc<X>`
when given an `Arc<X>`, and resolving that requires specialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Attaches both a custom reader payload (a `String` checked inside
`InMemoryParquetFileReaderFactory`) and a `ParquetAccessPlan` (skip
the first of two row groups) to the same `PartitionedFile`, then
asserts that (a) the factory still sees its payload and (b) only the
second row group's rows come through. Either consumer overwriting the
other's slot would break this end-to-end — the per-consumer tests
elsewhere catch the slots in isolation but not the coexistence
invariant this PR is meant to guarantee.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented May 4, 2026

Thanks @kosiew ! Addressed both points.

- Remove `Extensions::remove`: no callers anywhere in the tree.
- Remove `ExtendedStatistics::set_extension_arc` and
  `PartitionedFile::with_extension_arc`: speculative `_arc` variants the
  reviewer suggested adding only when callers actually need to skip the
  Arc allocation. None do today.
- Mark `Extensions::insert_dyn` as `#[deprecated]`: it exists solely to
  back the deprecated `PartitionedFile::with_extensions(Arc<dyn Any>)`
  shim. Marking it deprecated discourages new callers and lets it leave
  with the shim later.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 4, 2026
@adriangb adriangb requested a review from kosiew May 4, 2026 14:16
DataFusion's clippy config enables `clippy::allow-attributes`, which
mandates `#[expect(...)]` (so the suppression is automatically reported
when no longer needed) instead of `#[allow(...)]`. Local
`cargo clippy --all-targets --all-features -- -D warnings` did not
catch this; only CI's `./dev/rust_lint.sh` lint pipeline does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rluvaton
Copy link
Copy Markdown
Member

rluvaton commented May 4, 2026

Great to see the automatic label is working, so we wont miss API breaking changes now in the release docs

Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb
👍 The API shape and coexistence regression test both look addressed.

I found one doc example issue that could lead users to accidentally store the access plan under the wrong extension type.

/// // provide the plan as extension to the FileScanConfig
/// let partitioned_file = PartitionedFile::new("my_file.parquet", 1234)
/// .with_extensions(Arc::new(access_plan));
/// .with_extension(Arc::new(access_plan));
Copy link
Copy Markdown
Contributor

@kosiew kosiew May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example still needs a small adjustment. Passing Arc::new(access_plan) means PartitionedFile::with_extension<T>(value: T) stores the extension as Arc<ParquetAccessPlan>.

Later, create_initial_plan looks up extensions.get::<ParquetAccessPlan>(), so the access plan from this example would be ignored and all row groups may be scanned.

I think we should pass the concrete value instead:

.with_extension(access_plan);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants