Update qwen3.5-bf16-mi300x-sglang SGLang image to v0.5.12-rocm720-mi30x#1426
Conversation
Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
3 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM, straightforward SGLang image version bump.
Extended reasoning...
Overview
This PR bumps the SGLang image tag for the qwen3.5-bf16-mi300x-sglang config in .github/configs/amd-master.yaml from v0.5.10-rocm720-mi30x to v0.5.12-rocm720-mi30x and adds a corresponding entry to perf-changelog.yaml. It mirrors the pattern of several recent image-bump PRs (e.g. #1402, #1408, #1409) merged to main.
Security risks
None. This is a single image tag string change in a benchmark config plus a YAML changelog entry. No code paths, auth, secrets, or permissions are touched.
Level of scrutiny
Low. The change is mechanical, isolated to a single config block (only the image: field is touched, all sweep parameters are unchanged), and follows the established convention in this repo. The image tag follows the same naming scheme as the prior pin and as the sibling qwen3.5-bf16-mi325x-sglang entry, so the only real risk is whether the new SGLang version itself behaves correctly at runtime — which the full-sweep CI will exercise (the full-sweep-enabled label is set).
Other factors
The pr-link: XXX placeholder is consistent with the existing entry at line 2502 of perf-changelog.yaml, so this is the project's convention for auto-generated changelog stubs. No bugs were flagged by the bug-hunting system, no outstanding reviewer comments, and the timeline contains only duplicated recipe-reminder bot messages.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25980021895 |
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25980021895 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25984517234 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25984576195 |
Three of the nine mi300x compute nodes are currently unusable:
- chi-mi300x-033, chi-mi300x-037: down (Not responding)
- chi-mi300x-049: drained for persistent /nvme_home disk-full
(kept down by a watchdog re-applying State=DOWN every 10s)
Without a nodelist filter, salloc sometimes lands a job on a node
that's about to be drained or that has a half-extracted enroot dir,
causing 'pyxis: failed to create container filesystem (No space left
on device)' / 'srun: Node failure' / 'manifest unknown'-style errors
visible in PRs #1426 and #1403.
Add an explicit --nodelist of the 6 healthy nodes (mirroring how
runners/launch_b300-nv.sh:336 pins to the known-good B300 set).
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25984576195 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26008642156 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016677916 |
Updates SGLang image for
qwen3.5-bf16-mi300x-sglangfrom v0.5.10-rocm720-mi30x to v0.5.12-rocm720-mi30x.\nRef #1154
Generated with Claude Code