[Klaud Cold] Update minimaxm2.5-fp8-mi325x-vllm vLLM ROCm image to v0.21.0 by functionstackx · Pull Request #1469 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T02:15:54Z

Summary

Update vLLM ROCm image from v0.18.0 (50d old) to v0.21.0

Recipes touched: `minimaxm2.5-fp8-mi325x-vllm`

Test plan

full-sweep-enabled sweep passes.

🤖 Generated with Claude Code

Update vLLM ROCm image from v0.18.0 (50d old) to v0.21.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T02:16:02Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T02:16:02Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T02:16:02Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T02:16:02Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T02:16:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009960602
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26009960602

claude

LGTM — straightforward vLLM ROCm image version bump with corresponding perf-changelog entry.

Extended reasoning...

Overview

This PR bumps the vLLM ROCm image for the minimaxm2.5-fp8-mi325x-vllm recipe from v0.18.0 to v0.21.0 in .github/configs/amd-master.yaml, and adds a matching entry to perf-changelog.yaml with the PR link. Only two files are touched and the diff is two single-line changes plus a new changelog block.

Security risks

None. The change is purely a container image tag bump in benchmarking config; no auth, crypto, permission, or user-input code is involved. The new tag (v0.21.0) is a published vLLM ROCm release and is already in use by sibling recipes in the same file (e.g. kimik2.5-fp4-mi355x-vllm, minimaxm2.5-fp8-mi300x-vllm, minimaxm2.5-fp4-mi355x-vllm).

Level of scrutiny

Low. This is mechanical full-sweep-enabled automation output — a benchmarking recipe image bump, following an established pattern across the file. Test plan is the full sweep itself, which gates correctness post-merge.

Other factors

No bugs were flagged by the hunting system, the changelog entry is well-formed and links back to this PR, and the new image version is already validated by adjacent recipes.

functionstackx · 2026-05-18T02:30:53Z

Diagnosis

All 9 failed jobs show the same enroot container import failure -- not an OOM error. During enroot import of vllm/vllm-openai-rocm:v0.21.0, the "Converting whiteouts" phase emits ~30x enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted (exit code 34). The subsequent srun --container-image=... then fails with pyxis: No such file or directory: .../vllm_vllm-openai-rocm_v0.21.0.sqsh because the squash file was never created.

This is not a vLLM or script issue -- the same v0.21.0 ROCm image passes on mi300x (#1405, merged) and mi355x (#1406-#1410, merged). Sibling mi325x PRs #1467 and #1468 (also v0.21.0) show the identical enroot failure, confirming this is mi325x runner infrastructure-specific (likely missing CAP_SETFCAP or similar on chi-mi325x-pod1-121.ord.vultr.cpe.ice.amd.com).

Failed run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26009962520

No code fix pushed

No script/config change will fix this -- the enroot aufs2ovlfs converter needs CAP_SETFCAP capability on the runner node. This requires runner infra intervention (e.g., updating the enroot config or Slurm SPANK plugin settings on the mi325x nodes). A re-run after the infra fix should pass without any recipe changes.

Recommendation

Check if the mi325x runners recently had an OS/kernel/enroot update that dropped capabilities.
Alternatively, check if ENROOT_ALLOW_SUPERUSER=y or --allow-superuser is set in enroot config on mi300x/mi355x but not mi325x.
Once infra is fixed, re-run the sweep -- the script at benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh needs no changes for this failure mode.

Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T02:44:22Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009962520
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26009962520

github-actions · 2026-05-18T04:51:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26010696506
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26010696506

functionstackx · 2026-05-18T04:57:08Z

/reuse-sweep-run

github-actions · 2026-05-18T04:57:49Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26014367184
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26014367184

Update minimaxm2.5-fp8-mi325x-vllm vLLM ROCm image to v0.21.0

8edfd80

Update vLLM ROCm image from v0.18.0 (50d old) to v0.21.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 18, 2026 02:15

functionstackx added the full-sweep-enabled label May 18, 2026

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners May 18, 2026 02:15

functionstackx added the full-sweep-enabled label May 18, 2026

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

chore: fill pr-link for #1469

80a0138

claude Bot reviewed May 18, 2026

View reviewed changes

This was referenced May 18, 2026

[Klaud Cold] Update kimik2.5-int4-mi325x-vllm vLLM ROCm image to v0.21.0 #1468

Merged

[Klaud Cold] runners(mi325x): exclude broken enroot node chi-mi325x-pod1-121 #1477

Merged

Merge remote-tracking branch 'origin/main' into HEAD

9e43e40

Merge branch 'main' into update-minimaxm2.5-fp8-mi325x-vllm-v0.21.0

127dc78

functionstackx merged commit 8a928f6 into main May 18, 2026
3 of 5 checks passed

functionstackx deleted the update-minimaxm2.5-fp8-mi325x-vllm-v0.21.0 branch May 18, 2026 04:57

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] Update minimaxm2.5-fp8-mi325x-vllm vLLM ROCm image to v0.21.0#1469

[Klaud Cold] Update minimaxm2.5-fp8-mi325x-vllm vLLM ROCm image to v0.21.0#1469
functionstackx merged 4 commits into
mainfrom
update-minimaxm2.5-fp8-mi325x-vllm-v0.21.0

functionstackx commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

functionstackx commented May 18, 2026

Diagnosis

No code fix pushed

Recommendation

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant