feat: type-keyed extensions map for PartitionedFile#21993
feat: type-keyed extensions map for PartitionedFile#21993adriangb wants to merge 7 commits intoapache:mainfrom
Conversation
PartitionedFile.extensions previously held a single Option<Arc<dyn Any + Send + Sync>>, so two independent components could not both attach data to the same file without colliding. This commit introduces a shared TypeId-keyed Extensions map in datafusion-common and uses it for PartitionedFile so each Rust type occupies its own slot. The shared datafusion_common::extensions::Extensions type also gives us a single home for this pattern; SessionConfig.extensions and ExtendedStatistics.extensions can be migrated to it in follow-ups (both are TypeId-keyed maps duplicating the same shape today). Public API changes on PartitionedFile: - with_extensions(Arc<dyn Any + Send + Sync>) → with_extension<T>(T) and with_extension_arc<T>(Arc<T>) - extensions field type Option<Arc<dyn Any + Send + Sync>> → FileExtensions - new extension::<T>() accessor avoids manual downcasts Migrated in-tree call sites (parquet opener, custom_reader test, external_access_plan test, parquet_advanced_index example) and added an upgrade guide entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
…nsions Both SessionConfig.extensions and ExtendedStatistics.extensions were already TypeId-keyed Arc-of-Any maps, duplicating the type added in the previous commit for PartitionedFile.extensions. Replace both with the shared datafusion_common::extensions::Extensions, deleting the inline AnyMap type alias and IdHasher (the IdHasher optimization is preserved inside Extensions). No public API changes: - SessionConfig::with_extension / set_extension / get_extension keep the same signatures. - ExtendedStatistics::get_extension / set_extension / has_extension / merge_extensions keep the same signatures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Collapse Extensions::insert / insert_arc into a single insert<T>(Arc<T>). Values are always Arc'd internally (the map needs Clone, dyn Any isn't), but the call site signature is now consistent across all consumers (PartitionedFile, SessionConfig, ExtendedStatistics) — caller wraps in Arc explicitly. - Drop PartitionedFile::with_extension_arc; the single with_extension<T> takes Arc<T> directly. - Change ExtendedStatistics::set_extension<T>(value: T) to take Arc<T> for the same reason. - Re-add PartitionedFile::with_extensions(Arc<dyn Any + Send + Sync>) as a #[deprecated] shim that keys by the value's dynamic TypeId, restoring SemVer compatibility for code targeting 53.x. Backed by a new Extensions::insert_dyn(Arc<dyn Any + Send + Sync>) helper. - Update parquet call sites (test, example, docstring) to wrap in Arc. - Tweak the upgrade guide entry to reflect the deprecation and final API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@kosiew mind reviewing this change please? |
| /// Set a custom statistics extension. | ||
| pub fn set_extension<T: 'static + Send + Sync>(&mut self, value: T) { | ||
| self.extensions.insert(TypeId::of::<T>(), Arc::new(value)); | ||
| pub fn set_extension<T: 'static + Send + Sync>(&mut self, value: Arc<T>) { |
There was a problem hiding this comment.
ExtendedStatistics::set_extension now asks callers to allocate and pass an Arc<T>, while the existing public API accepted T and wrapped it internally. That makes this a source-breaking API change, even though the refactor is mainly about sharing the backing map.
The documented example nearby still shows stats.set_extension(HistogramStats { ... }), so existing users following that pattern will no longer compile. Could we keep the existing method shape, for example set_extension<T>(value: T) { self.extensions.insert(Arc::new(value)); }, and add a separate set_extension_arc only if callers need to avoid the allocation?
| @@ -71,7 +71,7 @@ async fn route_data_access_ops_to_parquet_file_reader_factory() { | |||
| .into_iter() | |||
| .map(|meta| { | |||
| PartitionedFile::new_from_meta(meta) | |||
There was a problem hiding this comment.
It would be great to add one integration regression test that attaches both a ParquetAccessPlan and a custom reader payload to the same PartitionedFile. The test could then assert that the custom reader still sees its payload and that the parquet access plan is honored.
The current tests cover the generic map and the two consumers separately, but not quite the end-to-end invariant this PR is trying to protect.
…ariants Restores `ExtendedStatistics::set_extension<T>(value: T)` (was source-breaking in this PR's previous revision when it switched to `Arc<T>`) and applies the same pattern to `PartitionedFile::with_extension`. Callers who already have an `Arc<T>` and want to skip the extra allocation can use the new `set_extension_arc` / `with_extension_arc` variants. `Extensions` itself gains the same split: `insert(T)` + `insert_arc(Arc<T>)`. A single method that accepts both `T` and `Arc<T>` is not expressible in stable Rust — both `impl Into<Arc<T>>` and a custom trait with two blanket impls leave the compiler unable to choose between `T = X` and `T = Arc<X>` when given an `Arc<X>`, and resolving that requires specialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Attaches both a custom reader payload (a `String` checked inside `InMemoryParquetFileReaderFactory`) and a `ParquetAccessPlan` (skip the first of two row groups) to the same `PartitionedFile`, then asserts that (a) the factory still sees its payload and (b) only the second row group's rows come through. Either consumer overwriting the other's slot would break this end-to-end — the per-consumer tests elsewhere catch the slots in isolation but not the coexistence invariant this PR is meant to guarantee. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks @kosiew ! Addressed both points. |
- Remove `Extensions::remove`: no callers anywhere in the tree. - Remove `ExtendedStatistics::set_extension_arc` and `PartitionedFile::with_extension_arc`: speculative `_arc` variants the reviewer suggested adding only when callers actually need to skip the Arc allocation. None do today. - Mark `Extensions::insert_dyn` as `#[deprecated]`: it exists solely to back the deprecated `PartitionedFile::with_extensions(Arc<dyn Any>)` shim. Marking it deprecated discourages new callers and lets it leave with the shim later. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFusion's clippy config enables `clippy::allow-attributes`, which mandates `#[expect(...)]` (so the suppression is automatically reported when no longer needed) instead of `#[allow(...)]`. Local `cargo clippy --all-targets --all-features -- -D warnings` did not catch this; only CI's `./dev/rust_lint.sh` lint pipeline does. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Great to see the automatic label is working, so we wont miss API breaking changes now in the release docs |
| /// // provide the plan as extension to the FileScanConfig | ||
| /// let partitioned_file = PartitionedFile::new("my_file.parquet", 1234) | ||
| /// .with_extensions(Arc::new(access_plan)); | ||
| /// .with_extension(Arc::new(access_plan)); |
There was a problem hiding this comment.
I think this example still needs a small adjustment. Passing Arc::new(access_plan) means PartitionedFile::with_extension<T>(value: T) stores the extension as Arc<ParquetAccessPlan>.
Later, create_initial_plan looks up extensions.get::<ParquetAccessPlan>(), so the access plan from this example would be ignored and all row groups may be scanned.
I think we should pass the concrete value instead:
.with_extension(access_plan);
Which issue does this PR close?
(no tracking issue yet; happy to file one if this direction is wanted)
Rationale for this change
PartitionedFile.extensionsis currently a singleOption<Arc<dyn Any + Send + Sync>>slot. That means two independent components cannot both attach their own per-file data without colliding — only one ofParquetAccessPlan, a custom index entry, a custom reader payload, etc. can be present at a time. Anyone wanting more than one piece of metadata has to define a wrapper struct and have every consumer agree on it, which is not possible because consumers are hard coded in DataFusion's code.While exploring this, I noticed DataFusion already has the same
HashMap<TypeId, Arc<dyn Any + Send + Sync>>pattern in two other places:datafusion/execution/src/config.rs— backsSessionConfig.extensions, with a cleverIdHasherthat avoids rehashingTypeIds.datafusion/physical-plan/src/operator_statistics/mod.rs— backsExtendedStatistics.extensions.Rather than add a third copy, this PR lifts the pattern into
datafusion-commonasdatafusion_common::extensions::Extensions(preserving theIdHasheroptimization) and migrates all three consumers to it.What changes are included in this PR?
This PR is split into two stacked commits:
Commit 1 —
feat: type-keyed extensions map for PartitionedFiledatafusion_common::extensions::Extensions(TypeId-keyedHashMapwith theIdHasherlifted fromSessionConfig). API:insert/insert_arc/get/get_arc/contains/remove/merge, all generic overT.PartitionedFile.extensionschanges fromOption<Arc<dyn Any + Send + Sync>>toFileExtensions(re-export ofExtensions).with_extension<T>(T)andwith_extension_arc<T>(Arc<T>)replacewith_extensions(Arc<dyn Any + Send + Sync>). New accessorextension::<T>()avoids manual downcasts.create_initial_plannow takes&FileExtensionsand usesget::<ParquetAccessPlan>()directly instead of doing the downcast inline.parquet/custom_reader.rstest,parquet/external_access_plan.rstest, and theparquet_advanced_index.rsexample.docs/source/library-user-guide/upgrading/54.0.0.md.Commit 2 —
refactor: migrate SessionConfig and ExtendedStatistics to shared ExtensionsSessionConfig.extensionsswitches from the inlineAnyMap/IdHashertoExtensions.ExtendedStatistics.extensionsswitches from inlineHashMap<TypeId, Arc<dyn Any + Send + Sync>>toExtensions.with_extension/set_extension/get_extension/has_extension/merge_extensions) — this commit is a pure refactor with no user-visible change.AnyMaptype alias andIdHasherimpl fromdatafusion/execution/src/config.rs.Are these changes tested?
Extensions(insert/get/remove/replace, merge precedence) indatafusion-common.parquet::custom_reader::route_data_access_ops_to_parquet_file_reader_factory, all 10parquet::external_access_plan::*) pass after migration.cargo test -p datafusion-execution -p datafusion-physical-plan— all passing (1397 + 63 + 14 + 10 unit/doc tests).cargo check --workspace --all-targetsclean../dev/rust_lint.shclean.Are there any user-facing changes?
Yes — breaking public-API change on
PartitionedFile(commit 1):The
extensionsfield's public type also changes (Option<Arc<dyn Any>>→FileExtensions). Upgrade guide entry added.Commit 2 has no user-visible API changes —
SessionConfigandExtendedStatisticskeep their existing extension method signatures.Open questions
ExtensionsvsAnyMapvsTypeMap. I went withExtensionsto match the existing field names; open to bikeshedding.🤖 Generated with Claude Code