[torch.compile] Bunch of small changes needed for enabling torch.compile by pggPL · Pull Request #3130 · NVIDIA/TransformerEngine

pggPL · 2026-06-15T14:41:43Z

Description

Small standalone fixes extracted from a larger torch.compile branch, going directly from main. Two independent changes: making Userbuffers pybind11 queries compile-friendly, and freeing quantized grad_output early for column-parallel SP. Plus a custom-recipe caching fix, a split-accumulator refactor, and a CI test hook-up.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Userbuffers pybind11 queries under torch.compile

is_fp8_ubuf() / with_cublasmp() are compile-time constants but graph-break when traced. At the nn.Module.forward boundary (where no UB communicator object is in hand yet) they go through get_ub_is_fp8(name, use_fp8), wrapped in torch.compiler.assume_constant_result — only plain (str, bool) args are baked, so guards are well-defined and don't rely on pybind-object identity.
In the hot forward/backward implementation paths the UB communicator is already fetched, so those call ub_obj.is_fp8_ubuf() / ub_obj.with_cublasmp() directly — no wrapper, no string concatenation, no redundant registry lookup. Eager speed is preserved.

Free quantized grad_output early for column-parallel SP

Row-parallel SP already called clear_tensor_data(grad_output) on the gathered tensor. Column-parallel SP quantizes grad_output to a Float8TensorStorage (an internal tensor) but never freed it. Under torch.compile reduce-overhead this left live pool tensors at recording end ("Detected N tensor(s) in the cudagraph pool not tracked as outputs"). The free now covers row-SP and column-SP-FP8 (column-SP non-FP8 is a no-op view, so it's excluded).

Replace fp8_recipe in LinearBwdArgs with pre-resolved split-accumulator booleans

LinearBwdArgs no longer carries the recipe object (which holds process-group references and is compile-unfriendly). dgrad_use_split_accumulator / wgrad_use_split_accumulator are resolved once in Linear.forward (reusing the existing get_fp8_recipe() call) and threaded through as plain booleans.

Custom-recipe quantizer caching fix

CustomRecipeState early-exit was missing an identity check, so quantizers were rebuilt on every forward even when the recipe was unchanged. Added if recipe_state.recipe is recipe: return.

Test hook-up

Added test_torch_compile.py to L0_pytorch_unittest.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…stants; fix SP memory leak; test suite hook-up Wrap CommOverlapCore pybind11 methods that return compile-time constants so torch.compile(fullgraph=True) can trace through them without graph breaks: - `is_fp8_ubuf()` → `ub_is_fp8()` / `get_ub_is_fp8()` in base.py; `_ub_is_fp8()` in gemm.py - `with_cublasmp()` → `ub_is_cublasmp()` in base.py All callers in linear.py, layernorm_linear.py, layernorm_mlp.py, base.py, gemm.py, userbuffers_backward_linear.py and userbuffers_forward_linear.py updated. Fix quantized grad_output not being freed early for column-parallel SP backward. Row-parallel SP already called clear_tensor_data(grad_output) to release the gathered tensor; column-parallel SP quantizes grad_output to Float8TensorStorage but never freed it before returning. Under torch.compile reduce-overhead this leaves 3 live pool tensors at recording end and triggers "Detected 3 tensor(s) in the cudagraph pool not tracked as outputs". Extend the existing clear_tensor_data guard to cover both parallel modes. Fix custom-recipe quantizer state being re-initialised on every forward call even when the recipe object has not changed. The existing early-exit for CustomRecipeState was missing an identity check on the recipe object, so any repeated call with the same recipe would bypass the early-return and rebuild quantizers unnecessarily. Add `if recipe_state.recipe is recipe: return` to restore the intended caching behaviour. Add test_torch_compile.py to L0_pytorch_unittest so the autocast and existing compile tests run in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

…-accumulator booleans LinearBwdArgs stored the entire FP8 recipe object so the backward could extract fp8_gemm_dgrad.use_split_accumulator and fp8_gemm_wgrad.use_split_accumulator at GEMM time. Recipe objects hold process-group references and are not serialisable as compile-time constants, making them incompatible with torch.compile custom-op paths. Replace fp8_recipe with two plain bool fields: - dgrad_use_split_accumulator (default _2X_ACC_DGRAD) - wgrad_use_split_accumulator (default _2X_ACC_WGRAD) These are resolved once in _linear_setup_ctx and passed into the args struct, so the backward consumes scalars instead of a live recipe object. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-06-15T14:49:03Z

Greptile Summary

This PR bundles five targeted torch.compile-readiness fixes extracted from a larger branch. Each change is independent and non-breaking.

UB pybind11 graph-break fix: get_ub_is_fp8() wraps is_fp8_ubuf() with @torch.compiler.assume_constant_result so the module boundary calls are folded as compile-time constants; destroy_ub() now calls torch.compiler.reset() to invalidate those constants on teardown.
Column-parallel SP FP8 pool-tensor leak: clear_tensor_data(grad_output) is extended to cover the column-parallel SP FP8 path (in addition to row-SP) in both linear.py and layernorm_linear.py — the quantized Float8TensorStorage is freed immediately after the wgrad GEMM, after which it is no longer referenced.
LinearBwdArgs recipe-object removal: fp8_recipe is replaced by two plain booleans (dgrad_use_split_accumulator, wgrad_use_split_accumulator) resolved once at forward time, eliminating a compile-unfriendly process-group reference from the autograd context.
CustomRecipeState rebuild fix: Adds an is identity guard so quantizers are not re-built on every forward when the custom recipe object is unchanged.
CI hook-up: test_torch_compile.py is added to the L0 test suite; the test's qfactory is corrected to dispatch on QuantizerRole.tensor_type (the previous string-keyed dict would have raised KeyError at runtime).

Confidence Score: 5/5

All changes are well-scoped and non-breaking; the column-SP FP8 tensor free is correctly placed after the wgrad GEMM and the freed variable is not accessed again.

Each change is narrowly targeted: the split-accumulator refactor preserves the same defaults for non-FP8 and reproduces the same hasattr guards for FP8 recipes; the grad_output free only triggers after the GEMM that consumed it; the CustomRecipeState identity check uses is which is appropriate for a mutable recipe object; and torch.compiler.reset() in destroy_ub() correctly handles stale cached constants. No correctness regressions were found across the traced code paths.

No files require special attention.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/module/base.py	Adds get_ub_is_fp8() wrapper with @assume_constant_result, adds identity check to CustomRecipeState early-exit, and calls torch.compiler.reset() in destroy_ub() to invalidate cached constants after UB teardown.
transformer_engine/pytorch/module/linear.py	Replaces fp8_recipe in LinearBwdArgs with pre-resolved split-accumulator booleans, extends grad_output early-free to column-parallel SP FP8 (after wgrad GEMM), and switches UB fp8 queries to get_ub_is_fp8().
transformer_engine/pytorch/module/layernorm_linear.py	Applies the same column-parallel SP FP8 grad_output early-free as linear.py, and switches UB fp8 queries to get_ub_is_fp8(). ctx.fp8_recipe remains for split-accumulator (not in scope of this PR).
transformer_engine/pytorch/module/layernorm_mlp.py	Single-line change: switches get_ub(...).is_fp8_ubuf() to get_ub_is_fp8() in the fc2_fprop path for compile-time constant folding.
tests/pytorch/test_torch_compile.py	Adds get_quantizer_roles() to ToyLinear module and fixes qfactory to dispatch on QuantizerRole.tensor_type; previous string keys like 'linear_input' would have caused a KeyError.
qa/L0_pytorch_unittest/test.sh	Adds test_torch_compile.py to the L0 CI test suite.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant M as Linear.forward()
    participant GSM as FP8GlobalStateManager
    participant FA as LinearFwdArgs
    participant BA as LinearBwdArgs
    participant BW as _linear_backward()

    M->>GSM: get_fp8_recipe()
    GSM-->>M: _recipe
    M->>M: resolve dgrad/wgrad split-accumulator bools
    M->>FA: fwd_args(dgrad_use_split_accumulator, wgrad_use_split_accumulator)
    FA->>BA: _linear_setup_ctx copies plain bools
    Note over FA,BA: No recipe object stored

    BW->>BW: "use_split_accumulator = bwd_args.dgrad_use_split_accumulator"
    BW->>BW: dgrad GEMM
    BW->>BW: "use_split_accumulator = bwd_args.wgrad_use_split_accumulator"
    BW->>BW: wgrad GEMM
    BW->>BW: clear_tensor_data(grad_output) [row-SP or col-SP+FP8]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant M as Linear.forward()
    participant GSM as FP8GlobalStateManager
    participant FA as LinearFwdArgs
    participant BA as LinearBwdArgs
    participant BW as _linear_backward()

    M->>GSM: get_fp8_recipe()
    GSM-->>M: _recipe
    M->>M: resolve dgrad/wgrad split-accumulator bools
    M->>FA: fwd_args(dgrad_use_split_accumulator, wgrad_use_split_accumulator)
    FA->>BA: _linear_setup_ctx copies plain bools
    Note over FA,BA: No recipe object stored

    BW->>BW: "use_split_accumulator = bwd_args.dgrad_use_split_accumulator"
    BW->>BW: dgrad GEMM
    BW->>BW: "use_split_accumulator = bwd_args.wgrad_use_split_accumulator"
    BW->>BW: wgrad GEMM
    BW->>BW: clear_tensor_data(grad_output) [row-SP or col-SP+FP8]

_{Reviews (3): Last reviewed commit: "Provide explicit QuantizerRoles in torch..." | Re-trigger Greptile}

pggPL · 2026-06-16T11:18:31Z

/te-ci pytorch L1

…t_result get_ub_is_fp8 bakes is_fp8_ubuf() as a compile-time constant; without a reset, destroy_ub + re-init with different FP8 settings would read stale values until recompile. Only affects in-memory caches, not disk. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

ToyLinear now overrides get_quantizer_roles so CustomRecipeState doesn't hit the no-roles warning, which graph-breaks under fullgraph=True. qfactory dispatches on role.tensor_type instead of a pre-baked string key. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

IvanYashchuk · 2026-06-22T10:35:04Z

+    # Compiled graphs may have baked is_fp8_ubuf() via assume_constant_result;
+    # reset so re-init with different settings doesn't read stale constants.
+    torch.compiler.reset()


The current helper call sites are all inside @no_torch_dynamo() forwards and the added test_torch_compile.py coverage does not exercise user buffers or it's done implicitly in the test?

Is it possible avoid a process-wide compiler reset on UB teardown, or add a targeted compiled UB test that proves the stale-constant case and justifies this global invalidation?

IvanYashchuk · 2026-06-22T10:38:29Z

 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_nvfp4.xml $TE_PATH/tests/pytorch/nvfp4 || test_fail "test_nvfp4"
 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_mxfp8.xml $TE_PATH/tests/pytorch/mxfp8 || test_fail "test_mxfp8"
 python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_quantized_tensor.xml $TE_PATH/tests/pytorch/test_quantized_tensor.py || test_fail "test_quantized_tensor.py"
+python3 -m pytest --tb=auto --junitxml=$XML_LOG_DIR/pytest_test_torch_compile.xml $TE_PATH/tests/pytorch/test_torch_compile.py || test_fail "test_torch_compile.py"


That file only compiles a local ToyLinear helper and torch.nn.Linear under te.autocast. It does not instantiate changed in this PR te.Linear, LayerNormLinear, or LayerNormMLP, and it has no UB, sequence_parallel/parallel_mode.

What tests would fail without changes to layernorm_linear, layernorm_mlp files?

I fix the issue that the test was not connected to the CI.
Currently it tests only if te.autocast() can be traced inside torch.compile.

This is first of series of PRs and I change here only small things to make next PRs cleaner.

pggPL and others added 2 commits June 15, 2026 16:40

pggPL requested a review from ksivaman as a code owner June 15, 2026 14:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

afe364b

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/module/base.py

pggPL added 2 commits June 16, 2026 14:05

pggPL mentioned this pull request Jun 17, 2026

[PyTorch][torch.compile] Decouple amax reduction group from the quantizer #3104

Open

13 tasks

IvanYashchuk reviewed Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch.compile] Bunch of small changes needed for enabling torch.compile#3130

[torch.compile] Bunch of small changes needed for enabling torch.compile#3130
pggPL wants to merge 5 commits into
NVIDIA:mainfrom
pggPL:torch_compile_small_fixes

pggPL commented Jun 15, 2026

Uh oh!

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

pggPL commented Jun 16, 2026

Uh oh!

IvanYashchuk Jun 22, 2026

Uh oh!

IvanYashchuk Jun 22, 2026

Uh oh!

pggPL Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pggPL commented Jun 15, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

pggPL commented Jun 16, 2026

Uh oh!

IvanYashchuk Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

pggPL Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 15, 2026 •

edited

Loading