[AMD][ROCM] gptoss-fp4-mi355x-atom: Bump image to rocm/atom:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom20260511#1412
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
3 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| description: | ||
| - "Bump ATOM image from rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post to rocm/atom-dev:nightly_202605111702" | ||
| - "ATOM upstream benchmark shows +8% to +20% throughput improvement vs InferenceX baselines (1 GPU, ISL=1024/8192, OSL=1024)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER |
There was a problem hiding this comment.
🔴 The new perf-changelog entry's pr-link is https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER — the template token was never substituted with the actual PR number (1412). The link 404s as written and breaks any tooling that parses these entries to map config-keys to PRs. Fix by replacing PLACEHOLDER with 1412 before merge.
Extended reasoning...
What the bug is
perf-changelog.yaml:2502 contains:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDERThe literal string PLACEHOLDER was clearly meant to be substituted with this PR's number (1412) before opening the PR, but the substitution never happened. Per AGENTS.md, the template token for this field is XXX (e.g. pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX); the author appears to have hand-edited the template to use PLACEHOLDER instead, then forgot to swap in the real number.
Why existing code doesn't prevent it
There is no schema validator or pre-merge check on perf-changelog.yaml that asserts pr-link ends in a numeric path component. All 295 other entries in the file use real PR IDs (e.g. the immediately preceding entry uses /pull/1271), but that's convention, not enforcement.
Impact
- The URL
https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER404s — anyone clicking through from the changelog entry to find the PR/discussion that introduced thegptoss-fp4-mi355x-atomimage bump won't reach it. - Any tooling that parses these entries to compute a config-key → PR-number mapping (e.g. for changelog rendering, regression bisection, attribution) will either crash on the non-integer suffix or silently record a bogus mapping.
How to fix
Replace the literal PLACEHOLDER on line 2502 with 1412 (this PR's number):
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1412Step-by-step proof
- PR metadata shows this is PR
#1412(<pr number="1412">in the PR header). perf-changelog.yaml:2502in the diff containspr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER.- Constructing the URL:
https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER— GitHub's pull-request URL parser expects an integer at that path position;PLACEHOLDERis not an integer, so the route returns 404. - Compare to the immediately preceding entry at line 2493–2496 which correctly uses
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1271— that link resolves to the actual PR. - The correct value is therefore
/pull/1412, and the fix is a one-character-region edit on line 2502.
Summary
rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.posttorocm/atom-dev:nightly_202605111702gptoss-fp4-mi355x-atomPerformance (ATOM Upstream vs InferenceX baseline, TP=1, 1 GPU, fp4, MI355X)
InferenceX baseline:
rocm/atom:rocm7.1.1-ubuntu24.04-pytorch2.9-atom0.1.1-MI350x(2026-01-14)ATOM upstream run: https://github.com/ROCm/ATOM/actions/runs/25686894636 (2026-05-11)
Test plan
🤖 Generated with Claude Code