Update qwen3.5-bf16-mi300x-sglang SGLang image to v0.5.12-rocm720-mi30x by Klaud-Cold · Pull Request #1426 · SemiAnalysisAI/InferenceX

Klaud-Cold · 2026-05-17T03:16:29Z

Updates SGLang image for qwen3.5-bf16-mi300x-sglang from v0.5.10-rocm720-mi30x to v0.5.12-rocm720-mi30x.
\nRef #1154

Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>

github-actions · 2026-05-17T03:16:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-17T03:16:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-17T03:16:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-17T03:16:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude

LGTM, straightforward SGLang image version bump.

Extended reasoning...

Overview

This PR bumps the SGLang image tag for the qwen3.5-bf16-mi300x-sglang config in .github/configs/amd-master.yaml from v0.5.10-rocm720-mi30x to v0.5.12-rocm720-mi30x and adds a corresponding entry to perf-changelog.yaml. It mirrors the pattern of several recent image-bump PRs (e.g. #1402, #1408, #1409) merged to main.

Security risks

None. This is a single image tag string change in a benchmark config plus a YAML changelog entry. No code paths, auth, secrets, or permissions are touched.

Level of scrutiny

Low. The change is mechanical, isolated to a single config block (only the image: field is touched, all sweep parameters are unchanged), and follows the established convention in this repo. The image tag follows the same naming scheme as the prior pin and as the sibling qwen3.5-bf16-mi325x-sglang entry, so the only real risk is whether the new SGLang version itself behaves correctly at runtime — which the full-sweep CI will exercise (the full-sweep-enabled label is set).

Other factors

The pr-link: XXX placeholder is consistent with the existing entry at line 2502 of perf-changelog.yaml, so this is the project's convention for auto-generated changelog stubs. No bugs were flagged by the bug-hunting system, no outstanding reviewer comments, and the timeline contains only duplicated recipe-reminder bot messages.

github-actions · 2026-05-17T03:32:59Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25980021895
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25980021895

# Conflicts: # perf-changelog.yaml

github-actions · 2026-05-17T07:24:12Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25980021895
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25980021895

github-actions · 2026-05-17T07:24:39Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25984517234
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25984517234

github-actions · 2026-05-17T08:56:40Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25984576195
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25984576195

Three of the nine mi300x compute nodes are currently unusable: - chi-mi300x-033, chi-mi300x-037: down (Not responding) - chi-mi300x-049: drained for persistent /nvme_home disk-full (kept down by a watchdog re-applying State=DOWN every 10s) Without a nodelist filter, salloc sometimes lands a job on a node that's about to be drained or that has a half-extracted enroot dir, causing 'pyxis: failed to create container filesystem (No space left on device)' / 'srun: Node failure' / 'manifest unknown'-style errors visible in PRs #1426 and #1403. Add an explicit --nodelist of the 6 healthy nodes (mirroring how runners/launch_b300-nv.sh:336 pins to the known-good B300 set).

github-actions · 2026-05-18T01:27:06Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25984576195
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25984576195

github-actions · 2026-05-18T06:03:47Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26008642156
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26008642156

functionstackx · 2026-05-18T06:10:29Z

/reuse-sweep-run

github-actions · 2026-05-18T06:11:23Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016677916
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26016677916

Update qwen3.5-bf16-mi300x-sglang SGLang image to v0.5.12-rocm720-mi30x

ff03ac3

Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>

Klaud-Cold requested a review from a team May 17, 2026 03:16

Klaud-Cold added the full-sweep-enabled label May 17, 2026

Klaud-Cold requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners May 17, 2026 03:16

github-project-automation Bot added this to InferenceMAX Board May 17, 2026

Klaud-Cold mentioned this pull request May 17, 2026

[Auto] Docker Image Updates Available - 2026-04-25 #1154

Open

claude Bot reviewed May 17, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into HEAD

2dedf57

# Conflicts: # perf-changelog.yaml

functionstackx added full-sweep-enabled and removed full-sweep-enabled labels May 17, 2026

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes #1462

Merged

2 tasks

claude-fix-bot added 2 commits May 17, 2026 21:25

Merge main + resolve changelog

4aef6a3

fix(perf-changelog): restore from main + reappend PR entry

2f60886

Merge branch 'main' into claude/issue-1154-qwen3.5-bf16-mi300x-sglang

e04735c

functionstackx merged commit 97ac477 into main May 18, 2026
3 of 5 checks passed

functionstackx deleted the claude/issue-1154-qwen3.5-bf16-mi300x-sglang branch May 18, 2026 06:10

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

This was referenced May 18, 2026

[Klaud Cold] Update dsr1-fp8-h200-trt (+mtp) TRT-LLM image to v1.3.0rc14 #1487

Open

[Klaud Cold] Update dsr1-fp4-b200-trt (+mtp) TRT-LLM image to v1.3.0rc14 #1489

Open

[Klaud Cold] Update gptoss-fp4-b200-trt TRT-LLM image to v1.3.0rc14 #1490

Open

Conversation

Klaud-Cold commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 17, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants