Skip to content

[Klaud Cold] Update minimaxm2.5-fp8-mi325x-vllm vLLM ROCm image to v0.21.0#1469

Merged
functionstackx merged 4 commits into
mainfrom
update-minimaxm2.5-fp8-mi325x-vllm-v0.21.0
May 18, 2026
Merged

[Klaud Cold] Update minimaxm2.5-fp8-mi325x-vllm vLLM ROCm image to v0.21.0#1469
functionstackx merged 4 commits into
mainfrom
update-minimaxm2.5-fp8-mi325x-vllm-v0.21.0

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Update vLLM ROCm image from v0.18.0 (50d old) to v0.21.0

Recipes touched: `minimaxm2.5-fp8-mi325x-vllm`

Test plan

  • full-sweep-enabled sweep passes.

🤖 Generated with Claude Code

Update vLLM ROCm image from v0.18.0 (50d old) to v0.21.0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

3 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward vLLM ROCm image version bump with corresponding perf-changelog entry.

Extended reasoning...

Overview

This PR bumps the vLLM ROCm image for the minimaxm2.5-fp8-mi325x-vllm recipe from v0.18.0 to v0.21.0 in .github/configs/amd-master.yaml, and adds a matching entry to perf-changelog.yaml with the PR link. Only two files are touched and the diff is two single-line changes plus a new changelog block.

Security risks

None. The change is purely a container image tag bump in benchmarking config; no auth, crypto, permission, or user-input code is involved. The new tag (v0.21.0) is a published vLLM ROCm release and is already in use by sibling recipes in the same file (e.g. kimik2.5-fp4-mi355x-vllm, minimaxm2.5-fp8-mi300x-vllm, minimaxm2.5-fp4-mi355x-vllm).

Level of scrutiny

Low. This is mechanical full-sweep-enabled automation output — a benchmarking recipe image bump, following an established pattern across the file. Test plan is the full sweep itself, which gates correctness post-merge.

Other factors

No bugs were flagged by the hunting system, the changelog entry is well-formed and links back to this PR, and the new image version is already validated by adjacent recipes.

@functionstackx
Copy link
Copy Markdown
Collaborator Author

Diagnosis

All 9 failed jobs show the same enroot container import failure -- not an OOM error. During enroot import of vllm/vllm-openai-rocm:v0.21.0, the "Converting whiteouts" phase emits ~30x enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted (exit code 34). The subsequent srun --container-image=... then fails with pyxis: No such file or directory: .../vllm_vllm-openai-rocm_v0.21.0.sqsh because the squash file was never created.

This is not a vLLM or script issue -- the same v0.21.0 ROCm image passes on mi300x (#1405, merged) and mi355x (#1406-#1410, merged). Sibling mi325x PRs #1467 and #1468 (also v0.21.0) show the identical enroot failure, confirming this is mi325x runner infrastructure-specific (likely missing CAP_SETFCAP or similar on chi-mi325x-pod1-121.ord.vultr.cpe.ice.amd.com).

Failed run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26009962520

No code fix pushed

No script/config change will fix this -- the enroot aufs2ovlfs converter needs CAP_SETFCAP capability on the runner node. This requires runner infra intervention (e.g., updating the enroot config or Slurm SPANK plugin settings on the mi325x nodes). A re-run after the infra fix should pass without any recipe changes.

Recommendation

  1. Check if the mi325x runners recently had an OS/kernel/enroot update that dropped capabilities.
  2. Alternatively, check if ENROOT_ALLOW_SUPERUSER=y or --allow-superuser is set in enroot config on mi300x/mi355x but not mi325x.
  3. Once infra is fixed, re-run the sweep -- the script at benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh needs no changes for this failure mode.

functionstackx added a commit that referenced this pull request May 18, 2026
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three
[Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every
failure landed on chi-mi325x-pod1-121 with

  enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted

before the .sqsh import even completes; subsequent pyxis mount then
fails with "No such file or directory". The same image works cleanly
on every other up node (017/018/019/020/027) — confirmed not OOM and
not a recipe issue.

This matches the existing pattern for mi300x in #1462 (pin salloc away
from chronically-bad nodes); for mi325x there's currently only the one
node to exclude, so use --exclude rather than --nodelist so we don't
have to maintain the allow-list as nodes come and go.

pod1-121 has separately been drained on the controller with a watchdog
(per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix
the underlying setcap regression.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@functionstackx functionstackx merged commit 8a928f6 into main May 18, 2026
3 of 5 checks passed
@functionstackx functionstackx deleted the update-minimaxm2.5-fp8-mi325x-vllm-v0.21.0 branch May 18, 2026 04:57
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant