[Klaud Cold] Update minimaxm2.5-fp8-mi325x-vllm vLLM ROCm image to v0.21.0#1469
Conversation
Update vLLM ROCm image from v0.18.0 (50d old) to v0.21.0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
3 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009960602 |
There was a problem hiding this comment.
LGTM — straightforward vLLM ROCm image version bump with corresponding perf-changelog entry.
Extended reasoning...
Overview
This PR bumps the vLLM ROCm image for the minimaxm2.5-fp8-mi325x-vllm recipe from v0.18.0 to v0.21.0 in .github/configs/amd-master.yaml, and adds a matching entry to perf-changelog.yaml with the PR link. Only two files are touched and the diff is two single-line changes plus a new changelog block.
Security risks
None. The change is purely a container image tag bump in benchmarking config; no auth, crypto, permission, or user-input code is involved. The new tag (v0.21.0) is a published vLLM ROCm release and is already in use by sibling recipes in the same file (e.g. kimik2.5-fp4-mi355x-vllm, minimaxm2.5-fp8-mi300x-vllm, minimaxm2.5-fp4-mi355x-vllm).
Level of scrutiny
Low. This is mechanical full-sweep-enabled automation output — a benchmarking recipe image bump, following an established pattern across the file. Test plan is the full sweep itself, which gates correctness post-merge.
Other factors
No bugs were flagged by the hunting system, the changelog entry is well-formed and links back to this PR, and the new image version is already validated by adjacent recipes.
DiagnosisAll 9 failed jobs show the same enroot container import failure -- not an OOM error. During This is not a vLLM or script issue -- the same v0.21.0 ROCm image passes on mi300x (#1405, merged) and mi355x (#1406-#1410, merged). Sibling mi325x PRs #1467 and #1468 (also v0.21.0) show the identical enroot failure, confirming this is mi325x runner infrastructure-specific (likely missing Failed run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26009962520 No code fix pushedNo script/config change will fix this -- the enroot Recommendation
|
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26009962520 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26010696506 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26014367184 |
Summary
Update vLLM ROCm image from v0.18.0 (50d old) to v0.21.0
Recipes touched: `minimaxm2.5-fp8-mi325x-vllm`
Test plan
🤖 Generated with Claude Code