Skip to content

multi-GPU VAE Fix for Cosmos 3#13924

Open
atharvajoshi10 wants to merge 1 commit into
huggingface:mainfrom
atharvajoshi10:multi-gpu-support
Open

multi-GPU VAE Fix for Cosmos 3#13924
atharvajoshi10 wants to merge 1 commit into
huggingface:mainfrom
atharvajoshi10:multi-gpu-support

Conversation

@atharvajoshi10

@atharvajoshi10 atharvajoshi10 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

multi-GPU VAE fix

What this does

Fixes a cross-device error that surfaced when running Cosmos 3 under sharded (device_map) placement.

Fix VAE latent-norm device mismatch under sharded placement

Under device_map="balanced", vae.encode() runs on the VAE's own device while the mean/inv_std normalization buffers were pinned to x.device, causing a cross-device RuntimeError. Now computes raw_mu first and pins the normalization buffers to its device so all tensors share one device.

Under sharded placement (device_map="balanced"), vae.encode() runs on the
VAE's own device while the mean/inv_std buffers were pinned to x.device,
causing a cross-device RuntimeError. Compute raw_mu first, then pin the
normalization buffers to its device so all tensors share one device.
@github-actions github-actions Bot added documentation Improvements or additions to documentation models pipelines size/L PR with diff > 200 LOC labels Jun 11, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul

Copy link
Copy Markdown
Member

Thanks for your PR!

This needs a bit of library-wise thinking. I'd suggest shipping the VAE fixes in this PR. We can maintain a script in our docs for the time being to show how this can be used.

@github-actions github-actions Bot added size/S PR with diff < 50 LOC and removed documentation Improvements or additions to documentation models size/L PR with diff > 200 LOC labels Jun 12, 2026
@atharvajoshi10 atharvajoshi10 changed the title Context parallelism + multi-GPU VAE Fix for Cosmos 3 multi-GPU VAE Fix for Cosmos 3 Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pipelines size/S PR with diff < 50 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants