[HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation by jimdowling · Pull Request #585 · logicalclocks/logicalclocks.github.io

jimdowling · 2026-05-21T12:49:02Z

Summary

User-guide section documenting the new partitioned_by parameter on feature group creation. Lives under the existing partitioning area in docs/user_guides/fs/feature_group/create.md.

Covers:

Usage example with create_feature_group / get_or_create_feature_group.
The storage-engine-derived contract: the user's dataframe never carries the grain columns; Delta GENERATED ALWAYS AS handles it server-side.
Validation rules (mutual exclusion with partition_key, requires event_time, enum membership).
Partition-pruning table — Delta auto-derives partition predicates from the GENERATED expressions for hierarchical specs. fg.read(start_time, end_time) and fg.filter(fg.event_time >= ...) prune at the partition level for hierarchical partitioned_by. Non-hierarchical specs (["month"], ["year","week"]) are valid but skip auto-derivation.
Online feature store behavior: derived columns live offline-only by default; online_partition_columns=true opts into online materialization.
Hudi: previously rejected pre-HWORKS-2807; post-HWORKS-2807 the same parameter works on Hudi via the server-side PartitionedByTransformer + CustomKeyGenerator.

Pairs with:

hopsworks-api#961 — Python client side.
hopsworks-ee#3034 — Backend.
loadtest#859 — End-to-end workflow.

JIRA: HWORKS-2802. Engineering walkthrough: Confluence page.

Test plan

npx markdownlint-cli2 docs/user_guides/fs/feature_group/create.md clean.
uv run mkdocs build -s clean (run after the SDK PR lands, since the API reference plugin pulls from hopsworks-api main).
Visual check of the rendered section via mkdocs serve.

🤖 Generated with Claude Code

…tion https://hopsworks.atlassian.net/browse/HWORKS-2802 Add a section to docs/user_guides/fs/feature_group/create.md describing the storage-engine-native partitioned_by parameter for Delta feature groups. Covers: - Usage example with create_feature_group / get_or_create_feature_group. - The CREATE TABLE … USING DELTA … GENERATED ALWAYS AS … contract: the storage layer derives the partition columns; the user's dataframe never carries them. - Validation rules: mutual exclusion with partition_key, requires event_time. - Partition pruning table — Delta auto-derives partition predicates from the GENERATED expressions for hierarchical specs (year / year+month / year+month+day / year+month+day+hour), so `fg.read(start_time=..., end_time=...)` and `fg.filter(fg.event_time >= ...)` prune at the partition level. Non-hierarchical specs (e.g. ["month"], ["year","week"]) are valid but skip the auto-derivation — only direct predicates on the grain columns prune. Recommend hierarchical specs. - Online feature store behavior: derived columns live offline-only by default; online_partition_columns=true opts into online materialization. Until the onlinefs consumer filter ships, the backend rejects partitioned_by + online_enabled=true with the default online_partition_columns=false. Document both workarounds. - Hudi: partitioned_by + HUDI is rejected at creation; Hudi support is tracked under a separate follow-up ticket. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

https://hopsworks.atlassian.net/browse/HWORKS-2802 The partitioned_by section described Delta GENERATED ALWAYS AS columns and storage-engine-side derivation, which is no longer how it works. Document the real design: the client derives the grain columns from event_time and writes them as real partition columns, pruning works natively on grain filters and via predicate translation on event_time ranges. Correct the online-store note: online-enabled partitioned_by feature groups are rejected entirely until HWORKS-2808, not only with the default online_partition_columns. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…io into HWORKS-2802

…note https://hopsworks.atlassian.net/browse/HWORKS-2802 The Hudi follow-up materializes the grain columns server-side and partitions on them directly; the CustomKeyGenerator phrasing described a mechanism the revised design no longer uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…io into HWORKS-2802

Copilot

Pull request overview

Adds documentation to the Feature Group creation guide describing the new partitioned_by parameter for time-grain partitioning.

Changes:

Introduces a new “Time-grain partitioning with partitioned_by” section with a Python usage example.
Documents partition-pruning behavior for hierarchical vs non-hierarchical grain specs.
Adds notes about online feature store and Hudi behavior (currently conflicting with the PR description).


 By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition.

+##### Time-grain partitioning with `partitioned_by` (Delta only)



+##### Time-grain partitioning with `partitioned_by` (Delta only)
+
+When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you.


+    partitioned_by=["year", "month", "day"],
+    time_travel_format="DELTA",
+)
+fg.insert(df)  # df does not need year/month/day — the client derives them


+The example above is equivalent to manually decomposing `tx_ts` into three columns and passing `partition_key=["year", "month", "day"]`.
+The grain columns are ordinary materialized partition columns: the client computes them from `event_time` on each write and the backend registers them as partition columns through the normal table-creation path.
+The source dataframe does not need to carry them.


+The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively.
+A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs:
+


+Online-enabled feature groups do not yet support `partitioned_by`.
+The online ingestion path does not exclude the offline-only grain columns from the Kafka/Avro schema, nor materialize them for the online write, so the backend rejects `partitioned_by` together with `online_enabled=true` until that work lands (tracked under a separate follow-up ticket).
+Keep the feature group offline-only to use `partitioned_by`.


+`partitioned_by` on `time_travel_format="HUDI"` feature groups is not yet supported and the backend rejects it at creation.
+Hudi materializes the grain columns server-side in the streaming materialization job, and that work is tracked under a separate follow-up ticket.
+Until that lands, use `time_travel_format="DELTA"` to get time-grain partitioning, or partition Hudi groups explicitly via `partition_key=["year"]` with a `year` column the upstream pipeline computes.


https://hopsworks.atlassian.net/browse/HWORKS-2802 Flesh out the partitioned_by section into reference for the shipped feature: the parameter list (partitioned_by + online_partition_columns with their constraints), cross-session persistence and the round-trip through get_feature_group, the on-disk Hive layout, a read/partition- pruning example with the hierarchical-vs-non-hierarchical matrix, a clickstream-by-hour example, and the current online and Hudi limitations (online rejected at create and on enable). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…io into HWORKS-2802

https://hopsworks.atlassian.net/browse/HWORKS-2807 partitioned_by now works on DELTA and ICEBERG; NONE is rejected alongside Hudi. Update the section heading, supported-formats note, and the Hudi fallback guidance. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jimdowling changed the title ~~[HWORKS-2802] Document partitioned_by parameter on feature group creation~~ [HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation May 21, 2026

jimdowling and others added 7 commits May 30, 2026 11:43

Merge remote-tracking branch 'upstream/main' into HWORKS-2802

e3d5db3

Merge branch 'main' of github.com:logicalclocks/logicalclocks.github.…

523c327

…io into HWORKS-2802

Merge branch 'main' of github.com:logicalclocks/logicalclocks.github.…

a899acc

…io into HWORKS-2802

Merge branch 'main' of github.com:logicalclocks/logicalclocks.github.…

c28568c

…io into HWORKS-2802

Merge branch 'main' of github.com:logicalclocks/logicalclocks.github.…

f1376e2

…io into HWORKS-2802

jimdowling marked this pull request as ready for review June 11, 2026 04:35

jimdowling requested a review from Copilot June 11, 2026 04:35

Copilot started reviewing on behalf of jimdowling June 11, 2026 04:35 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

jimdowling and others added 3 commits June 11, 2026 06:41

Merge branch 'main' of github.com:logicalclocks/logicalclocks.github.…

fcaf241

…io into HWORKS-2802

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation#585

[HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation#585
jimdowling wants to merge 11 commits into
logicalclocks:mainfrom
jimdowling:HWORKS-2802

jimdowling commented May 21, 2026 •

edited by atlassian Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		By using partitioning the system will write the feature data in different subdirectories, thus allowing you to write 10240 files per partition.

		##### Time-grain partitioning with `partitioned_by` (Delta only)


		##### Time-grain partitioning with `partitioned_by` (Delta only)

		When the partition columns are derived from the feature group's `event_time`, hand the backend the desired time grains with `partitioned_by=[...]` and the Python client derives the partition columns for you.

		The grain columns are real partition columns, so a filter on a grain column (for example `year == 2026`) prunes partitions natively.
		A filter on an `event_time` range is rewritten into equivalent grain-column predicates by the query layer, so it prunes too on hierarchical specs:

Conversation

jimdowling commented May 21, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimdowling commented May 21, 2026 •

edited by atlassian Bot

Loading