Skip to content

[HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation#585

Open
jimdowling wants to merge 11 commits into
logicalclocks:mainfrom
jimdowling:HWORKS-2802
Open

[HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation#585
jimdowling wants to merge 11 commits into
logicalclocks:mainfrom
jimdowling:HWORKS-2802

Conversation

@jimdowling

@jimdowling jimdowling commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

User-guide section documenting the new partitioned_by parameter on feature group creation. Lives under the existing partitioning area in docs/user_guides/fs/feature_group/create.md.

Covers:

  • Usage example with create_feature_group / get_or_create_feature_group.
  • The storage-engine-derived contract: the user's dataframe never carries the grain columns; Delta GENERATED ALWAYS AS handles it server-side.
  • Validation rules (mutual exclusion with partition_key, requires event_time, enum membership).
  • Partition-pruning table — Delta auto-derives partition predicates from the GENERATED expressions for hierarchical specs. fg.read(start_time, end_time) and fg.filter(fg.event_time >= ...) prune at the partition level for hierarchical partitioned_by. Non-hierarchical specs (["month"], ["year","week"]) are valid but skip auto-derivation.
  • Online feature store behavior: derived columns live offline-only by default; online_partition_columns=true opts into online materialization.
  • Hudi: previously rejected pre-HWORKS-2807; post-HWORKS-2807 the same parameter works on Hudi via the server-side PartitionedByTransformer + CustomKeyGenerator.

Pairs with:

JIRA: HWORKS-2802. Engineering walkthrough: Confluence page.

Test plan

  • npx markdownlint-cli2 docs/user_guides/fs/feature_group/create.md clean.
  • uv run mkdocs build -s clean (run after the SDK PR lands, since the API reference plugin pulls from hopsworks-api main).
  • Visual check of the rendered section via mkdocs serve.

🤖 Generated with Claude Code

…tion

https://hopsworks.atlassian.net/browse/HWORKS-2802

Add a section to docs/user_guides/fs/feature_group/create.md
describing the storage-engine-native partitioned_by parameter for
Delta feature groups. Covers:

- Usage example with create_feature_group / get_or_create_feature_group.
- The CREATE TABLE … USING DELTA … GENERATED ALWAYS AS … contract:
  the storage layer derives the partition columns; the user's
  dataframe never carries them.
- Validation rules: mutual exclusion with partition_key, requires
  event_time.
- Partition pruning table — Delta auto-derives partition predicates
  from the GENERATED expressions for hierarchical specs (year /
  year+month / year+month+day / year+month+day+hour), so
  `fg.read(start_time=..., end_time=...)` and
  `fg.filter(fg.event_time >= ...)` prune at the partition level.
  Non-hierarchical specs (e.g. ["month"], ["year","week"]) are valid
  but skip the auto-derivation — only direct predicates on the
  grain columns prune. Recommend hierarchical specs.
- Online feature store behavior: derived columns live offline-only
  by default; online_partition_columns=true opts into online
  materialization. Until the onlinefs consumer filter ships, the
  backend rejects partitioned_by + online_enabled=true with the
  default online_partition_columns=false. Document both
  workarounds.
- Hudi: partitioned_by + HUDI is rejected at creation; Hudi support
  is tracked under a separate follow-up ticket.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jimdowling jimdowling changed the title [HWORKS-2802] Document partitioned_by parameter on feature group creation [HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation May 21, 2026
jimdowling and others added 7 commits May 30, 2026 11:43
https://hopsworks.atlassian.net/browse/HWORKS-2802

The partitioned_by section described Delta GENERATED ALWAYS AS columns and
storage-engine-side derivation, which is no longer how it works. Document
the real design: the client derives the grain columns from event_time and
writes them as real partition columns, pruning works natively on grain
filters and via predicate translation on event_time ranges. Correct the
online-store note: online-enabled partitioned_by feature groups are
rejected entirely until HWORKS-2808, not only with the default
online_partition_columns.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…note

https://hopsworks.atlassian.net/browse/HWORKS-2802

The Hudi follow-up materializes the grain columns server-side and
partitions on them directly; the CustomKeyGenerator phrasing described
a mechanism the revised design no longer uses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jimdowling jimdowling marked this pull request as ready for review June 11, 2026 04:35
@jimdowling jimdowling requested a review from Copilot June 11, 2026 04:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation to the Feature Group creation guide describing the new partitioned_by parameter for time-grain partitioning.

Changes:

  • Introduces a new “Time-grain partitioning with partitioned_by” section with a Python usage example.
  • Documents partition-pruning behavior for hierarchical vs non-hierarchical grain specs.
  • Adds notes about online feature store and Hudi behavior (currently conflicting with the PR description).


By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition.

##### Time-grain partitioning with `partitioned_by` (Delta only)

##### Time-grain partitioning with `partitioned_by` (Delta only)

When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you.
partitioned_by=["year", "month", "day"],
time_travel_format="DELTA",
)
fg.insert(df) # df does not need year/month/day — the client derives them
Comment on lines +122 to +124
The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`.
The grain columns are ordinary materialized partition columns: the client computes them from `event_time` on each write and the backend registers them as partition columns through the normal table-creation path.
The source dataframe does not need to carry them.
Comment on lines +131 to +133
The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively.
A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs:

Comment on lines +148 to +150
Online-enabled feature groups do not yet support `partitioned_by`.
The online ingestion path does not exclude the offline-only grain columns from the Kafka/Avro schema, nor materialize them for the online write, so the backend rejects `partitioned_by` together with `online_enabled=true` until that work lands (tracked under a separate follow-up ticket).
Keep the feature group offline-only to use `partitioned_by`.
Comment on lines +154 to +156
`partitioned_by` on `time_travel_format="HUDI"` feature groups is not yet supported and the backend rejects it at creation.
Hudi materializes the grain columns server-side in the streaming materialization job, and that work is tracked under a separate follow-up ticket.
Until that lands, use `time_travel_format="DELTA"` to get time-grain partitioning, or partition Hudi groups explicitly via `partition_key=["year"]` with a `year` column the upstream pipeline computes.
jimdowling and others added 3 commits June 11, 2026 06:41
https://hopsworks.atlassian.net/browse/HWORKS-2802

Flesh out the partitioned_by section into reference for the shipped
feature: the parameter list (partitioned_by + online_partition_columns
with their constraints), cross-session persistence and the round-trip
through get_feature_group, the on-disk Hive layout, a read/partition-
pruning example with the hierarchical-vs-non-hierarchical matrix, a
clickstream-by-hour example, and the current online and Hudi
limitations (online rejected at create and on enable).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
https://hopsworks.atlassian.net/browse/HWORKS-2807

partitioned_by now works on DELTA and ICEBERG; NONE is rejected alongside
Hudi. Update the section heading, supported-formats note, and the Hudi
fallback guidance.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants