[ExecuTorch][WebGPU] 2D compute dispatch — lift the 65535 per-dim cap (prefill path) by JulianCloudNTH · Pull Request #20583 · pytorch/executorch

JulianCloudNTH · 2026-06-28T16:22:52Z

Stack from ghstack (oldest at bottom):

Lift the 65535 workgroup-per-dim dispatch cap so single-shot SDPA prefill runs at any sequence length.

Problem: The WebGPU backend is 1D-dispatch-only and throws when a kernel's workgroup count exceeds the device per-dim limit (maxComputeWorkgroupsPerDimension, spec floor 65535). SDPA prefill QK exceeds it around S~362 (softmax/AV at S=2048), blocking single-shot / long-context prefill.

Solution: Fold a >limit 1D workgroup count into 2D; the shader reconstructs the linear index from @builtin(num_workgroups).

Before: compute_1d_workgroup_count throws if count > limit; dispatch (count, 1, 1).
After: compute_2d_workgroup_count returns {count, 1} (fast path) or a near-square {x, y} (x = ceil(sqrt(count)) clamped to limit, y = div_up(count, x)); dispatch (x, y, 1). A flat {limit, div_up(count, limit)} split would idle up to ~half the launched workgroups when count just exceeds limit; the near-square split holds the waste to O(sqrt(count)) (e.g. 65536 -> {256, 256}, 0 inactive).

Implementation:

WgCount + pure fold_workgroup_count_2d + compute_2d_workgroup_count in WebGPUUtils.h (device-free, unit-testable; queried_max_workgroups factored out of the 1D path)
WebGPUDispatch.workgroup_count_y (default 1, declared last so existing aggregate inits are unchanged); both dispatchWorkgroups calls + the profiling record pass (x, y, 1)
Per-kernel in-shader reconstruction: thread-form idx = gid.x + gid.y*(num_workgroups.x*wg_size) (QK/AV/add); row-form row_idx = wid.x + wid.y*num_workgroups.x (softmax — keeps a valid predicate, not an early return, so workgroupBarrier()s stay uniform)
Sdpa.cpp: QK/softmax/AV counts via the 2D helper; the dynamic-input_pos resize hook recomputes both x and y for QK
Reference: ET-Vulkan dispatches over natural N-D extents (never folds a flat count nor guards the per-dim limit) and MLX get_2d_grid_dims packs whole tensor dims; for our flattened scalar count the near-square split is the correct no-shape-info analog (a pack-to-limit split would reproduce the idle-half waste)

Constraints:

y=1 fast path keeps every non-folded dispatch byte-identical to the prior 1D path
Scope = prefill path only; rms_norm/embedding/lm_head/update_cache are row/token-indexed and never hit the cap, so they keep the 1D path
Throws if a 3rd dispatch dimension would be needed — unreachable for real prefill (the uint32 element guard fires first at S~11585)

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109517684

[ghstack-poisoned]

pytorch-bot · 2026-06-28T16:22:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20583

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit a0b5f02 with merge base 124b677 ():

NEW FAILURE - The following job has failed:

pull / test-binary-size-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t 1a4a0520403b7c463b3f058b666d03c6d510b7adbb6ec44bb5b0efe0dff5298c /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

pull / unittest-nxp-neutron / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-28T16:23:33Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-29T20:21:29Z

@claude review and check for any areas or opportunities for modularization

claude · 2026-06-29T20:22:20Z

Claude finished @JulianCloudNTH's task in 3m 55s —— View job

Review: WebGPU 2D compute dispatch (#20583)

Read the full diff
Review correctness & WGSL shader changes
Assess modularization opportunities
Post review

Solid, well-scoped change. The y=1 fast path keeping non-folded dispatches byte-identical is the right call, the device-free fold_workgroup_count_2d is genuinely unit-testable, and keeping a valid predicate (not an early return) in softmax to preserve workgroupBarrier() uniformity is exactly correct WGSL — that's the subtle bug this avoids. Correctness of the index reconstruction checks out: for the thread-form shaders num_workgroups.x == max_count and lid.y == 0 (workgroup_size y is 1), so idx = gid.x + gid.y*(num_workgroups.x*wg_size) reconstructs the linear thread id, and the idx >= total guard absorbs the over-dispatch from rounding y up. 👍

🔴 Blocking: lint failure (CI red)

backends/webgpu/runtime/ops/add/BinaryOp.cpp:167 — clang-format wants the add_resize lambda capture list reflowed (it fits on fewer lines now). This is the only CI failure. Just run:

lintrunner -a

Fix this →

Modularization opportunities

1. The WebGPUDispatch aggregate-init is the main smell. To avoid touching the ~10 unchanged call sites, workgroup_count_y was declared after kernel_name (WebGPUGraph.h:51), which forces the awkward positional init in BinaryOp.cpp:162:

graph.add_dispatch({pipeline, bind_group, workgroup_count.x, "", workgroup_count.y});
//                                                            ^^ kernel_name placeholder

A field that's logically adjacent to workgroup_count_x now sits two members away purely for init-ordering, and callers must thread an empty-string placeholder. Cleaner options:

Add an add_dispatch overload taking utils::WgCount directly, so the binary-op path reads graph.add_dispatch({pipeline, bind_group, wgc}) with no placeholder; or
Keep x/y adjacent in the struct and update the other call sites (they default y=1 trivially).

2. dispatch.x = wgc.x; dispatch.y = wgc.y; is repeated 4× across the resize hooks (BinaryOp.cpp:186-187, Sdpa.cpp:643-644,651-652,661-662). A one-liner setter removes the copy-paste and the risk of updating one without the other:

inline void set_workgroup_count(WebGPUDispatch& d, utils::WgCount c) {
  d.workgroup_count_x = c.x;
  d.workgroup_count_y = c.y;
}

3. build_dispatch in Sdpa.cpp gained a separate workgroup_count_y param (between workgroup_count_x and wg_size). Passing a single utils::WgCount would shrink each of the three call sites (QK/softmax/AV) and the update_cache call (which passes a literal 1) and keep the dispatch geometry as one value.

4. Minor: reuse div_up. compute_2d_workgroup_count (WebGPUUtils.h:104) and fold_workgroup_count_2d:73 open-code (a + b - 1) / b, while compute_1d_workgroup_count:88 calls div_up. Use div_up in both for consistency.

5. Shader-side duplication (noting, not necessarily actionable). The thread-form reconstruction let idx = gid.x + gid.y * (num_workgroups.x * wg_size); is now identical in binary_add.wgsl, sdpa_compute_attn_weights.wgsl, and sdpa_compute_out.wgsl. WGSL has no include mechanism in this codebase, so a shared snippet isn't free — but if more ops adopt the 2D fold, a generated prelude string concatenated at pipeline-build time would centralize it. Fine to leave as-is for this PR's scope.

Nits / questions

num_workgroups.x * wg_size overflow. For add, wg_size=256 and num_workgroups.x ≤ 65535 → ~16.7M, well within u32. The PR notes the uint32 element guard fires first at S~11585, so this is safe today — but the product is implicit; if a future op uses a large wg_size with a near-limit x it could overflow silently. A brief comment in the shaders stating the safe envelope would help the next reader.
queried_max_workgroups is called per dispatch (build-time, not hot path), each doing a wgpuDeviceGetLimits. Negligible, just flagging it isn't cached.

None of the modularization items are blockers — only the lint fix is required to turn CI green. Items 1–3 would meaningfully reduce the positional-init fragility if you want to fold them in.

[ghstack-poisoned]

… (prefill path) Pull Request resolved: #20583 **Lift the 65535 workgroup-per-dim dispatch cap so single-shot SDPA prefill runs at any sequence length.** **Problem**: The WebGPU backend is 1D-dispatch-only and throws when a kernel's workgroup count exceeds the device per-dim limit (`maxComputeWorkgroupsPerDimension`, spec floor 65535). SDPA prefill QK exceeds it around S~362 (softmax/AV at S=2048), blocking single-shot / long-context prefill. **Solution**: Fold a >limit 1D workgroup count into 2D; the shader reconstructs the linear index from `@builtin(num_workgroups)`. - **Before**: `compute_1d_workgroup_count` throws if `count > limit`; dispatch `(count, 1, 1)`. - **After**: `compute_2d_workgroup_count` returns `{count, 1}` (fast path) or a near-square `{x, y}` (`x = ceil(sqrt(count))` clamped to `limit`, `y = div_up(count, x)`); dispatch `(x, y, 1)`. A flat `{limit, div_up(count, limit)}` split would idle up to ~half the launched workgroups when `count` just exceeds `limit`; the near-square split holds the waste to `O(sqrt(count))` (e.g. 65536 -> `{256, 256}`, 0 inactive). **Implementation**: - `WgCount` + pure `fold_workgroup_count_2d` + `compute_2d_workgroup_count` in `WebGPUUtils.h` (device-free, unit-testable; `queried_max_workgroups` factored out of the 1D path) - `WebGPUDispatch.workgroup_count_y` (default 1, declared last so existing aggregate inits are unchanged); both `dispatchWorkgroups` calls + the profiling record pass `(x, y, 1)` - Per-kernel in-shader reconstruction: thread-form `idx = gid.x + gid.y*(num_workgroups.x*wg_size)` (QK/AV/add); row-form `row_idx = wid.x + wid.y*num_workgroups.x` (softmax — keeps a `valid` predicate, not an early return, so `workgroupBarrier()`s stay uniform) - `Sdpa.cpp`: QK/softmax/AV counts via the 2D helper; the dynamic-`input_pos` resize hook recomputes both x and y for QK - Reference: ET-Vulkan dispatches over natural N-D extents (never folds a flat count nor guards the per-dim limit) and MLX `get_2d_grid_dims` packs whole tensor dims; for our flattened scalar count the near-square split is the correct no-shape-info analog (a pack-to-limit split would reproduce the idle-half waste) **Constraints**: - `y=1` fast path keeps every non-folded dispatch byte-identical to the prior 1D path - Scope = prefill path only; `rms_norm`/`embedding`/`lm_head`/`update_cache` are row/token-indexed and never hit the cap, so they keep the 1D path - Throws if a 3rd dispatch dimension would be needed — unreachable for real prefill (the `uint32` element guard fires first at S~11585) Co-authored-with: Claude Code. ghstack-source-id: 399812920 @exported-using-ghexport Differential Revision: [D109517684](https://our.internmc.facebook.com/intern/diff/D109517684/)

Update

d54c2d2

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 28, 2026 16:22 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026

Update

0037674

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 29, 2026

Update

3dba1d3

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive

Update

f98c501

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 30, 2026 21:10 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jun 30, 2026

[ExecuTorch][WebGPU] 2D-fold mul + permute dispatch (lift 65535 1D cap) #20651

Merged

Update

8e56149

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence July 2, 2026 23:14 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jul 2, 2026

[ExecuTorch][WebGPU] Convert remaining native tests to GTest #20706

Merged

psiddh approved these changes Jul 3, 2026

View reviewed changes

Update

65799d1

[ghstack-poisoned]

JulianCloudNTH had a problem deploying to cadence July 3, 2026 20:28 — with GitHub Actions Error

JulianCloudNTH temporarily deployed to cadence July 3, 2026 20:28 — with GitHub Actions Inactive

Update

4c2cb5e

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence July 3, 2026 20:52 — with GitHub Actions Inactive

JulianCloudNTH had a problem deploying to cadence July 3, 2026 21:21 — with GitHub Actions Error

Update

a0b5f02

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence July 3, 2026 21:37 — with GitHub Actions Inactive

JulianCloudNTH temporarily deployed to cadence July 3, 2026 22:06 — with GitHub Actions Inactive

meta-codesync Bot merged commit 99d36d6 into gh/JulianCloudNTH/75/base Jul 4, 2026
180 of 183 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/75/head branch July 4, 2026 17:06

meta-codesync Bot temporarily deployed to cherry-pick-bot July 4, 2026 17:06 Inactive

pytorchbot mentioned this pull request Jul 4, 2026

[ExecuTorch][WebGPU] 2D compute dispatch — lift the 65535 per-dim cap (prefill path) #20722

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] 2D compute dispatch — lift the 65535 per-dim cap (prefill path)#20583

[ExecuTorch][WebGPU] 2D compute dispatch — lift the 65535 per-dim cap (prefill path)#20583
meta-codesync[bot] merged 9 commits into
gh/JulianCloudNTH/75/basefrom
gh/JulianCloudNTH/75/head

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20583

❌ 1 New Failure, 1 Cancelled Job

Uh oh!

github-actions Bot commented Jun 28, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: WebGPU 2D compute dispatch (#20583)

🔴 Blocking: lint failure (CI red)

Modularization opportunities

Nits / questions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 29, 2026 •

edited

Loading