feat(lora): save/restore LoRA config in checkpoint metadata by RexBearIU · Pull Request #4269 · AI-Hypercomputer/maxtext

RexBearIU · 2026-06-25T10:35:32Z

Description

This PR implements native serialization of LoRA configuration parameters (lora_rank, lora_alpha) in standard Orbax _CHECKPOINT_METADATA files, and automatically restores them during checkpoint-to-Hugging Face conversion.

Why is this change being made?

Previously, users had to manually supply matching lora.lora_rank and lora.lora_alpha parameters when converting MaxText checkpoints to Hugging Face format. Storing them in Orbax metadata makes the conversion seamless and error-free (resolves @igorts-git's request in #3970).

Key Implementation Details

Serialization: In save_checkpoint (checkpointing.py), we save the active config.lora block under the "lora" key in Orbax's custom_metadata when a LoRA rank is specified.
Restoration: In main (to_huggingface.py), sync_lora_metadata reads the custom metadata from lora_restore_path via ocp.StandardCheckpointer and overrides active config parameters during conversion.
Fail-Fast Safety: Scoped strictly to the conversion path to ensure SFT training paths remain strict and fail fast on any configuration mismatches.
Test Import Refactoring: Refactored hf_checkpoint_conversion_test.py to move dynamically loaded inline imports to global top-level imports and completely removed json import since JSON string is written directly.

BUGS: #3970

Tests

We have verified the implementation with complete suite-level and individual unit-tests:

Added/Updated Unit Tests:
- SyncLoRAMetadataTest in tests/unit/hf_checkpoint_conversion_test.py to verify the auto-resolving mechanism during Hugging Face conversion.
Command to run:
python tests/unit/hf_checkpoint_conversion_test.py
All tests pass successfully.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-06-25T10:39:20Z

Codecov Report

❌ Patch coverage is 85.29412% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...rc/maxtext/checkpoint_conversion/to_huggingface.py	0.00%	2 Missing ⚠️
src/maxtext/utils/lora_utils.py	92.59%	2 Missing ⚠️
src/maxtext/common/checkpointing.py	80.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

shralex

Thanks Jackie! A significant thing missing in this PR is using the metadata file on checkpoint restore path.

RexBearIU · 2026-06-25T15:14:08Z

Hi @shralex, thank you for the feedback!

I have fully addressed your comments with the following changes:

Checkpoint Restore Auto-Sync: Implemented automatic LoRA rank and alpha syncing from the Orbax native _CHECKPOINT_METADATA file's custom_metadata on the training/SFT restore path (restore_lora_from_path in lora_utils.py). Now, training/SFT runs resuming or restoring from a LoRA checkpoint will automatically detect, sync, and apply the correct LoRA rank and alpha parameters from the saved checkpoint metadata.
Unified Native Orbax Metadata: Switched from creating and loading a custom lora_config.json to using Orbax's native custom_metadata dictionary inside _CHECKPOINT_METADATA. This conforms perfectly to standard checkpointing conventions without introducing any custom, out-of-band config files.
Path Resilience: Enhanced metadata resolution to support paths pointing to either the step directory directly (e.g., .../checkpoints/1000/) or to any nested parameter subfolders (e.g., .../checkpoints/1000/items/), resolving parent paths gracefully.
Expanded Unit Tests & Linting: Added and modified tests (SyncLoRAMetadataTest and SyncLoRAMetadataTrainingTest in both test suites) covering both conversion and training/SFT-side auto-restore flows. Verified everything compiles, passes all pre-commit formatting/styling, and is 100% green!

Please let me know if you would like any other enhancements!

shralex · 2026-06-25T15:44:03Z

  max_logging.log(f"Elapse for transform and save: {(time.time() - start) / 60:.2f} min")


+def sync_lora_metadata(config) -> None:


can we import and reuse this function from lora_utils ?

Hi @shralex, in our latest iteration we actually removed sync_lora_metadata from lora_utils.py entirely! This was done to keep SFT training/fine-tuning paths strict and 'fail-fast' on configuration mismatches (letting runs crash immediately on mismatched checkpoint configs). Since the synchronization function is no longer part of lora_utils.py, we keep it isolated exclusively inside to_huggingface.py for conversion only.

moved back to lora_utils to re-use

xibinliu · 2026-06-26T16:52:04Z

Thanks Jackie! A significant thing missing in this PR is using the metadata file on checkpoint restore path.

added the logic to re-use the metadata for checkpoint restore.

RexBearIU mentioned this pull request Jun 25, 2026

docs: QLoRA Documentation and Notebooks #3970

Merged

4 tasks

shralex requested changes Jun 25, 2026

View reviewed changes

RexBearIU changed the title ~~feat(lora): serialize and load lora_config.json sidecar metadata~~ feat(lora): save and auto-restore LoRA rank/alpha using native Orbax custom_metadata Jun 25, 2026

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 187905b to cd17578 Compare June 25, 2026 15:13

shralex reviewed Jun 25, 2026

View reviewed changes

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from cd17578 to 1b15640 Compare June 25, 2026 16:02

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch from 1b15640 to ae44adc Compare June 25, 2026 16:11

igorts-git reviewed Jun 25, 2026

View reviewed changes

Comment thread tests/unit/hf_checkpoint_conversion_test.py Outdated

RexBearIU force-pushed the jackyf/lora-ckpt-metadata branch 3 times, most recently from 69c78a7 to a701719 Compare June 26, 2026 02:50

RexBearIU changed the title ~~feat(lora): save and auto-restore LoRA rank/alpha using native Orbax custom_metadata~~ feat(lora): save/restore LoRA config in checkpoint metadata Jun 26, 2026

igorts-git approved these changes Jun 26, 2026

View reviewed changes

xibinliu force-pushed the jackyf/lora-ckpt-metadata branch from a701719 to 07c5e19 Compare June 26, 2026 16:42

xibinliu force-pushed the jackyf/lora-ckpt-metadata branch 2 times, most recently from 43370d8 to 5940e65 Compare June 26, 2026 23:21

feat(lora): save/restore LoRA config in checkpoint metadata

9bc253e

xibinliu force-pushed the jackyf/lora-ckpt-metadata branch from 5940e65 to 9bc253e Compare June 26, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(lora): save/restore LoRA config in checkpoint metadata#4269

feat(lora): save/restore LoRA config in checkpoint metadata#4269
RexBearIU wants to merge 1 commit into
mainfrom
jackyf/lora-ckpt-metadata

RexBearIU commented Jun 25, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

shralex left a comment

Uh oh!

RexBearIU commented Jun 25, 2026 •

edited

Loading

Uh oh!

shralex Jun 25, 2026

Uh oh!

RexBearIU Jun 26, 2026 •

edited

Loading

Uh oh!

xibinliu Jun 26, 2026

Uh oh!

Uh oh!

xibinliu commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		max_logging.log(f"Elapse for transform and save: {(time.time() - start) / 60:.2f} min")


		def sync_lora_metadata(config) -> None:

Uh oh!

Conversation

RexBearIU commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why is this change being made?

Key Implementation Details

Tests

Checklist

Uh oh!

codecov Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shralex left a comment

Choose a reason for hiding this comment

Uh oh!

RexBearIU commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shralex Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

RexBearIU Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xibinliu Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xibinliu commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RexBearIU commented Jun 25, 2026 •

edited

Loading

codecov Bot commented Jun 25, 2026 •

edited

Loading

RexBearIU commented Jun 25, 2026 •

edited

Loading

RexBearIU Jun 26, 2026 •

edited

Loading