[HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation#585
Open
jimdowling wants to merge 11 commits into
Open
[HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation#585jimdowling wants to merge 11 commits into
jimdowling wants to merge 11 commits into
Conversation
…tion https://hopsworks.atlassian.net/browse/HWORKS-2802 Add a section to docs/user_guides/fs/feature_group/create.md describing the storage-engine-native partitioned_by parameter for Delta feature groups. Covers: - Usage example with create_feature_group / get_or_create_feature_group. - The CREATE TABLE … USING DELTA … GENERATED ALWAYS AS … contract: the storage layer derives the partition columns; the user's dataframe never carries them. - Validation rules: mutual exclusion with partition_key, requires event_time. - Partition pruning table — Delta auto-derives partition predicates from the GENERATED expressions for hierarchical specs (year / year+month / year+month+day / year+month+day+hour), so `fg.read(start_time=..., end_time=...)` and `fg.filter(fg.event_time >= ...)` prune at the partition level. Non-hierarchical specs (e.g. ["month"], ["year","week"]) are valid but skip the auto-derivation — only direct predicates on the grain columns prune. Recommend hierarchical specs. - Online feature store behavior: derived columns live offline-only by default; online_partition_columns=true opts into online materialization. Until the onlinefs consumer filter ships, the backend rejects partitioned_by + online_enabled=true with the default online_partition_columns=false. Document both workarounds. - Hudi: partitioned_by + HUDI is rejected at creation; Hudi support is tracked under a separate follow-up ticket. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
https://hopsworks.atlassian.net/browse/HWORKS-2802 The partitioned_by section described Delta GENERATED ALWAYS AS columns and storage-engine-side derivation, which is no longer how it works. Document the real design: the client derives the grain columns from event_time and writes them as real partition columns, pruning works natively on grain filters and via predicate translation on event_time ranges. Correct the online-store note: online-enabled partitioned_by feature groups are rejected entirely until HWORKS-2808, not only with the default online_partition_columns. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…io into HWORKS-2802
…io into HWORKS-2802
…note https://hopsworks.atlassian.net/browse/HWORKS-2802 The Hudi follow-up materializes the grain columns server-side and partitions on them directly; the CustomKeyGenerator phrasing described a mechanism the revised design no longer uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…io into HWORKS-2802
…io into HWORKS-2802
Contributor
There was a problem hiding this comment.
Pull request overview
Adds documentation to the Feature Group creation guide describing the new partitioned_by parameter for time-grain partitioning.
Changes:
- Introduces a new “Time-grain partitioning with
partitioned_by” section with a Python usage example. - Documents partition-pruning behavior for hierarchical vs non-hierarchical grain specs.
- Adds notes about online feature store and Hudi behavior (currently conflicting with the PR description).
|
|
||
| By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition. | ||
|
|
||
| ##### Time-grain partitioning with `partitioned_by` (Delta only) |
|
|
||
| ##### Time-grain partitioning with `partitioned_by` (Delta only) | ||
|
|
||
| When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you. |
| partitioned_by=["year", "month", "day"], | ||
| time_travel_format="DELTA", | ||
| ) | ||
| fg.insert(df) # df does not need year/month/day — the client derives them |
Comment on lines
+122
to
+124
| The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`. | ||
| The grain columns are ordinary materialized partition columns: the client computes them from `event_time` on each write and the backend registers them as partition columns through the normal table-creation path. | ||
| The source dataframe does not need to carry them. |
Comment on lines
+131
to
+133
| The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively. | ||
| A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs: | ||
|
|
Comment on lines
+148
to
+150
| Online-enabled feature groups do not yet support `partitioned_by`. | ||
| The online ingestion path does not exclude the offline-only grain columns from the Kafka/Avro schema, nor materialize them for the online write, so the backend rejects `partitioned_by` together with `online_enabled=true` until that work lands (tracked under a separate follow-up ticket). | ||
| Keep the feature group offline-only to use `partitioned_by`. |
Comment on lines
+154
to
+156
| `partitioned_by` on `time_travel_format="HUDI"` feature groups is not yet supported and the backend rejects it at creation. | ||
| Hudi materializes the grain columns server-side in the streaming materialization job, and that work is tracked under a separate follow-up ticket. | ||
| Until that lands, use `time_travel_format="DELTA"` to get time-grain partitioning, or partition Hudi groups explicitly via `partition_key=["year"]` with a `year` column the upstream pipeline computes. |
https://hopsworks.atlassian.net/browse/HWORKS-2802 Flesh out the partitioned_by section into reference for the shipped feature: the parameter list (partitioned_by + online_partition_columns with their constraints), cross-session persistence and the round-trip through get_feature_group, the on-disk Hive layout, a read/partition- pruning example with the hierarchical-vs-non-hierarchical matrix, a clickstream-by-hour example, and the current online and Hudi limitations (online rejected at create and on enable). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…io into HWORKS-2802
https://hopsworks.atlassian.net/browse/HWORKS-2807 partitioned_by now works on DELTA and ICEBERG; NONE is rejected alongside Hudi. Update the section heading, supported-formats note, and the Hudi fallback guidance. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
User-guide section documenting the new
partitioned_byparameter on feature group creation. Lives under the existing partitioning area indocs/user_guides/fs/feature_group/create.md.Covers:
create_feature_group/get_or_create_feature_group.GENERATED ALWAYS AShandles it server-side.partition_key, requiresevent_time, enum membership).fg.read(start_time, end_time)andfg.filter(fg.event_time >= ...)prune at the partition level for hierarchicalpartitioned_by. Non-hierarchical specs (["month"],["year","week"]) are valid but skip auto-derivation.online_partition_columns=trueopts into online materialization.PartitionedByTransformer+CustomKeyGenerator.Pairs with:
JIRA: HWORKS-2802. Engineering walkthrough: Confluence page.
Test plan
npx markdownlint-cli2 docs/user_guides/fs/feature_group/create.mdclean.uv run mkdocs build -sclean (run after the SDK PR lands, since the API reference plugin pulls fromhopsworks-apimain).mkdocs serve.🤖 Generated with Claude Code