LTX2 Performance tuning by prishajain1 · Pull Request #385 · AI-Hypercomputer/maxdiffusion

prishajain1 · 2026-04-19T07:11:27Z

This PR adds features which lead to performance gains for LTX-2 model, along with a fix for the broken LTX-2 Upsampler in main

Performance enhancement features:

Sharding fix in NNXSImpleFeedForward : Data is sharded along the sequence dimension, each device holds a subset of tokens, but full feature channels. Because the input data had replicated features but the weights expected sharded features on the same physical axis (context), XLA was forced to insert an All-Gather on the sequence dimension to resolve the layout conflict, resulting in high wasted time. With our fix:
- Overall % of wasted time in all-gathers went from 52.56% to 38.07%
- Generation time per video dropped from 20s to 16.7s
QKV Projection Sharding fix: The profiling showed that the input data was being All-Gathered along the sequence dimension triggered by the QKV Projection step. Because the weights were sharded on the dimension that needed to be summed over (features), a single device could not complete the matrix multiplication using only its local shard of the data. To resolve this, XLA automatically inserted an All-Gather to replicate the sequence dimension across all devices before performing the multiplication. We changed the weight sharding in attention_ltx2.py to remove sharding on the input feature dimension.
- Overall % of wasted time in all-gathers went from 38.07% to 19.39%
- Generation time per video dropped from 16.7 to 13.84s
Batching in text encoder: With CFG enabled, we see two passes of text encoder: one each for positive and negative prompts. If Classifier-Free Guidance is enabled, we concatenate the positive prompt and negative prompt and instead of doing two passes of text encoder, we do a single pass.
- Text encoder time reduced from 3.54s to 3.06s
- Generation time per video dropped from 13.84s to 13.38s
JITting Diffusion Loop: The current implementation uses a Python for loop to iterate over diffusion timesteps. This created a "Python Dispatch Wall," resulting in some idle time between consecutive forward passes while the TPU waited for the host CPU to dispatch the next step. We refactored the entire denoising loop to use nnx.scan.
- The total diffusion time across 40 steps dropped from 7.84s to 7.28s
- Generation time per video dropped from 13.38s to 12.5s

LTX2 Upsampler fix:

The current LTX2 Upsampler pipeline raises ValueError : blur_down is the name of the submodule in the PyTorch state dict from the Hugging Face checkpoint. In the original PyTorch model, that layer was named blur_down, but in the MaxDiffusion Flax implementation, it was named blur. Because our weight loader didn't rename it, nnx.update tried to update a non-existent blur_down attribute.

Results

Version	Execution Time	Status
Current Main	20.01s	Video Link
After Fix	12.50s	Video Link

We also tested WAN I2V pipelines to ensure no regressions are caused there. No quality regression or increased latency was observed.

…d, heads) to (None, heads)

…o prisha/ltx2_opt

…extError

github-actions · 2026-04-19T07:12:48Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

Perseus14 · 2026-04-19T10:23:25Z

    # Out kernel: [in_features (heads), out_features (embed)]
    out_kernel_init = nnx.with_partitioning(nnx.initializers.lecun_normal(), ("heads", "embed"))
    # Out bias: [out_features (embed)]
    out_bias_init = nnx.with_partitioning(nnx.initializers.zeros_init(), ("embed",))


Since qkv_kernel is changed to (None, "heads"), should we do the same for the out_kernel? Maybe change it to ("heads", None) and out_bias to (None,)

Perseus14 · 2026-04-19T10:45:41Z

+  final_carry, _ = nnx.scan(
+      scan_body,
+      in_axes=(nnx.Carry, 0, None),
+      out_axes=(nnx.Carry, 0),
+  )(initial_carry, timesteps_jax, transformer)


We have a config param scan_layers (which regulates transformer blocks, not the iterative pipeline loop) as well as this nnx.scan diffusion loop. Could this change confuse what the scan_layers config controls for developers? Perhaps we can add a comment to the scan_layers config that this controls only transformer blocks

@entrpn

prishajain1 added 17 commits April 15, 2026 17:19

upsampler fix

6b72f95

ff sharding fix

8425c37

embed -> none

0a36e73

changed the weight initialization for to_q, to_k, and to_v from (embe…

691863b

…d, heads) to (None, heads)

Merge remote-tracking branch 'origin/ulysses-attention-benchmark' int…

90e32b0

…o prisha/ltx2_opt

ulysses attention

3d50d6f

fix ulysses attn error

56c2fc7

registering a2v and v2a attention

5ca96b5

debug

c3dd9fe

text encoder batching

19a2dc6

removing debug

00c1609

refactor for full jit of diffusion loop

6ba53e3

Fix non-hashable static argument in run_diffusion_loop

de45665

Move nnx.merge inside scan_body to fix TraceContextError

cb63764

Replace jax.lax.scan with nnx.scan to fix TraceContextError

70b78b3

Pass transformer as broadcasted argument to nnx.scan to fix TraceCont…

13bcbb8

…extError

Cast scheduler output back to input dtype in scan body

b59c094

prishajain1 requested a review from entrpn as a code owner April 19, 2026 07:11

prishajain1 added 2 commits April 19, 2026 07:16

reformatted

4ffd8c7

pipeline test fix after text encoder batching

edea34d

prishajain1 marked this pull request as draft April 19, 2026 09:00

Force float32 for latents in run_diffusion_loop to test precision effect

e7aceb3

Perseus14 reviewed Apr 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LTX2 Performance tuning#385

LTX2 Performance tuning#385
prishajain1 wants to merge 20 commits intomainfrom
prisha/ltx2_opt

prishajain1 commented Apr 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 19, 2026

Uh oh!

Uh oh!

Perseus14 Apr 19, 2026 •

edited

Loading

Uh oh!

Perseus14 Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prishajain1 commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance enhancement features:

LTX2 Upsampler fix:

Results

Uh oh!

github-actions bot commented Apr 19, 2026

Uh oh!

Uh oh!

Perseus14 Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Perseus14 Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prishajain1 commented Apr 19, 2026 •

edited

Loading

Perseus14 Apr 19, 2026 •

edited

Loading