Update dsr1-fp4-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x#1431
Conversation
Ref #1154 Co-authored-by: Klaud Cold <Klaud-Cold@users.noreply.github.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - config-keys: | ||
| - dsr1-fp4-mi355x-sglang | ||
| description: | ||
| - "Update SGLang image from v0.5.9-rocm700-mi35x to v0.5.12-rocm700-mi35x" | ||
| pr-link: XXX |
There was a problem hiding this comment.
🟡 The new entry added to perf-changelog.yaml (line 2556) uses pr-link: XXX as a placeholder instead of an actual PR URL. All preceding entries follow the convention https://github.com/SemiAnalysisAI/InferenceX/pull/<num>, so this should be updated to https://github.com/SemiAnalysisAI/InferenceX/pull/1431 before merge.
Extended reasoning...
What's wrong
The diff adds a new entry to perf-changelog.yaml:
- config-keys:
- dsr1-fp4-mi355x-sglang
description:
- "Update SGLang image from v0.5.9-rocm700-mi35x to v0.5.12-rocm700-mi35x"
pr-link: XXXThe pr-link: XXX value is clearly a placeholder that was never filled in.
Why this breaks the file convention
Every preceding entry in perf-changelog.yaml follows a consistent pattern of using a full GitHub PR URL. Looking at the immediately preceding entries:
- Line 2526:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<num> - Line 2532:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<num> - Line 2538:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<num> - Line 2544:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/<num> - Line 2550:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1402(the entry directly above the new one)
The placeholder XXX is not a valid URL and provides no traceability back to the originating PR.
Step-by-step proof
- Open
perf-changelog.yamland scroll to the bottom of the file. - Observe lines 2552-2556 which are newly added by this PR.
- At line 2556, the value of
pr-linkis the literal stringXXX. - Compare against any prior entry (e.g., line 2550): the value is a fully-qualified GitHub PR URL.
- Anyone (or any tool) attempting to follow this link or programmatically resolve PR numbers from the changelog will fail on this entry.
Impact
This is a documentation/changelog correctness issue only — it does not affect runtime behavior of the framework, deployment, or sweeps. However, the changelog is used to trace which PR caused which performance change, so a missing/placeholder link defeats the purpose of the entry.
Fix
Replace pr-link: XXX on line 2556 with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1431 before merging.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25980026044 |
…Tokenizer.all_special_tokens_extended crash) The sglang v0.5.12 container bundles a newer transformers library where LlamaTokenizer.all_special_tokens_extended was removed. vLLM's get_cached_tokenizer (called from vllm.transformers_utils.tokenizer.get_tokenizer) probes that attribute and crashes the benchmark client before any request is sent. Swap the import preference order so backend_request_func.get_tokenizer (this repo's HuggingFace AutoTokenizer wrapper, with its own _fix_tokenizer_for_sglang post-hoc patcher) is tried first. The vLLM import remains as a fallback for environments without backend_request_func.
) Root-caused via the failed sweeps on #1431, #1432, #1440, #1441, #1443 — every failure landed on either: mia1-p01-g09 pyxis: failed to create container filesystem (extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs) mia1-p01-g11 permission denied while trying to connect to docker.sock (cluster-cleanup `docker stop` step fails; cascading into pyxis-init failure) Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the launcher wasn't excluding them. This mirrors the existing pattern in runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121). Once this lands the 5 affected mi355x PRs can be rebased to pick it up and the failed jobs will land on healthy nodes only. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26008680533 |
The agentic-coding matrix for dsr1-fp4-mi355x-sglang fans out into 10 matrix jobs (conc-list spans 1..256), which roughly doubles this image bump PR's sweep cost without adding signal — the bump only needs the fixed-seq-len throughput shape to validate. Re-enable after the PR merges; the next agentic cron PR will pick it up.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26046376775 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26047312829 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26073210292 |
Updates SGLang image for
dsr1-fp4-mi355x-sglangfrom v0.5.9-rocm700-mi35x to v0.5.12-rocm700-mi35x.\nRef #1154
Generated with Claude Code