Add MoE load balancing loss to distillation by JamesDeng42 · Pull Request #3679 · AI-Hypercomputer/maxtext

JamesDeng42 · 2026-04-16T00:05:46Z

Description

This PR introduces support for Mixture of Experts (MoE) load balancing loss during the distillation workflow.

Key Changes

NNX Intermediate Extraction (maxtext_utils.py & qwen3.py):
- Replaced legacy Linen self.sow(...) calls with native nnx.Intermediate(load_balance_loss) inside
Distillation Strategy Updates (distillation_utils.py & train_distill.py):
- Upgraded DistillationForwardOutput to carry the collected moe_lb_loss.
- Updated CombinedDistillationStrategy to actively add the moe_lb_loss to the total_loss so the optimizer
  minimizes it.
- Surfaced "distill/moe_lb_loss" to the metrics dictionary for TensorBoard logging and visibility.
Model Mutability (models.py):
- Automatically appended "intermediates" to the mutable_collections list during the Transformer's forward pass
  whenever load_balance_loss_weight > 0.0 to ensure NNX variables can successfully write to the state.

Tests

Added "distill/moe_lb_loss" to the expected metrics keys in the test suite to prevent regressions in train_distill_test.py.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-16T00:11:33Z

Codecov Report

❌ Patch coverage is 22.22222% with 14 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ners/post_train/distillation/distillation_utils.py	37.50%	3 Missing and 2 partials ⚠️
.../trainers/post_train/distillation/train_distill.py	16.66%	4 Missing and 1 partial ⚠️
src/maxtext/models/models.py	0.00%	1 Missing and 1 partial ⚠️
src/maxtext/models/qwen3.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

vlad-karp

please address the comment

…ght application

gagika

thanks

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch 2 times, most recently from 69341c6 to 414e2f0 Compare April 16, 2026 00:24

ajkv-google reviewed Apr 20, 2026

View reviewed changes

Comment thread src/maxtext/utils/maxtext_utils.py Outdated

vlad-karp reviewed Apr 20, 2026

View reviewed changes

Comment thread src/maxtext/models/models.py

Comment thread src/maxtext/trainers/post_train/distillation/distillation_utils.py

gagika reviewed Apr 20, 2026

View reviewed changes

Comment thread src/maxtext/utils/maxtext_utils.py Outdated

Comment thread src/maxtext/trainers/post_train/distillation/distillation_utils.py

Comment thread src/maxtext/trainers/post_train/distillation/distillation_utils.py

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch from 414e2f0 to 9628440 Compare April 21, 2026 18:22

vlad-karp reviewed Apr 21, 2026

View reviewed changes

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch 3 times, most recently from ae5c4cf to 74012ee Compare April 21, 2026 22:35

gagika reviewed Apr 21, 2026

View reviewed changes

Comment thread src/maxtext/trainers/post_train/distillation/distillation_utils.py Outdated

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch from 74012ee to 4d546e3 Compare April 21, 2026 23:56

gagika reviewed Apr 22, 2026

View reviewed changes

Comment thread src/maxtext/models/qwen3.py

Comment thread src/maxtext/trainers/post_train/distillation/train_distill.py Outdated

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch 3 times, most recently from ccd17dc to 2a27030 Compare April 22, 2026 18:47

Fix MoE load balancing loss in distillation by removing redundant wei…

2833dab

…ght application

JamesDeng42 force-pushed the yujiedeng/load-balance-loss branch from 2a27030 to 2833dab Compare April 22, 2026 23:43

gagika approved these changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MoE load balancing loss to distillation#3679

Add MoE load balancing loss to distillation#3679
JamesDeng42 wants to merge 1 commit intomainfrom
yujiedeng/load-balance-loss

JamesDeng42 commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlad-karp left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gagika left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JamesDeng42 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vlad-karp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gagika left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JamesDeng42 commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading