Skip to content

[AMD][ROCM] gptoss-fp4-mi355x-atom: Bump image to rocm/atom:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom20260511#1412

Open
seungrokj wants to merge 5 commits into
mainfrom
srok/atom_gptoss_fp4_mi355x
Open

[AMD][ROCM] gptoss-fp4-mi355x-atom: Bump image to rocm/atom:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom20260511#1412
seungrokj wants to merge 5 commits into
mainfrom
srok/atom_gptoss_fp4_mi355x

Conversation

@seungrokj
Copy link
Copy Markdown
Collaborator

Summary

  • Bump ATOM image from rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post to rocm/atom-dev:nightly_202605111702
  • Add perf-changelog entry for gptoss-fp4-mi355x-atom

Performance (ATOM Upstream vs InferenceX baseline, TP=1, 1 GPU, fp4, MI355X)

ISL OSL Conc InferenceX (tok/s) ATOM Upstream (tok/s) Diff %
1024 1024 16 4808.49 5757.55 +19.7%
1024 1024 32 7537.19 8869.83 +17.7%
1024 1024 64 11737.08 13566.85 +15.6%
1024 1024 128 17577.22 19234.78 +9.4%
8192 1024 4 7563.76 8183.21 +8.2%
8192 1024 8 12188.56 13409.95 +10.0%
8192 1024 16 18128.47 20788.08 +14.7%
8192 1024 32 25815.80 29588.12 +14.6%
8192 1024 64 35913.41 40886.65 +13.8%
8192 1024 128 44999.29 50888.02 +13.1%

InferenceX baseline: rocm/atom:rocm7.1.1-ubuntu24.04-pytorch2.9-atom0.1.1-MI350x (2026-01-14)
ATOM upstream run: https://github.com/ROCm/ATOM/actions/runs/25686894636 (2026-05-11)

Test plan

  • Verify benchmark runs with new image on mi355x runner
  • Verify throughput improvement vs baseline

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

3 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@seungrokj seungrokj changed the title [AMD][ROCM] gptoss-fp4-mi355x-atom: bump ATOM image to nightly_202605111702 [AMD][ROCM] gptoss-fp4-mi355x-atom: Bump image to rocm/atom:rocm7.2.3_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom20260511 May 16, 2026
Comment thread perf-changelog.yaml Outdated
description:
- "Bump ATOM image from rocm/atom:rocm7.2.2_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.2.post to rocm/atom-dev:nightly_202605111702"
- "ATOM upstream benchmark shows +8% to +20% throughput improvement vs InferenceX baselines (1 GPU, ISL=1024/8192, OSL=1024)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog entry's pr-link is https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER — the template token was never substituted with the actual PR number (1412). The link 404s as written and breaks any tooling that parses these entries to map config-keys to PRs. Fix by replacing PLACEHOLDER with 1412 before merge.

Extended reasoning...

What the bug is

perf-changelog.yaml:2502 contains:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER

The literal string PLACEHOLDER was clearly meant to be substituted with this PR's number (1412) before opening the PR, but the substitution never happened. Per AGENTS.md, the template token for this field is XXX (e.g. pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX); the author appears to have hand-edited the template to use PLACEHOLDER instead, then forgot to swap in the real number.

Why existing code doesn't prevent it

There is no schema validator or pre-merge check on perf-changelog.yaml that asserts pr-link ends in a numeric path component. All 295 other entries in the file use real PR IDs (e.g. the immediately preceding entry uses /pull/1271), but that's convention, not enforcement.

Impact

  1. The URL https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER 404s — anyone clicking through from the changelog entry to find the PR/discussion that introduced the gptoss-fp4-mi355x-atom image bump won't reach it.
  2. Any tooling that parses these entries to compute a config-key → PR-number mapping (e.g. for changelog rendering, regression bisection, attribution) will either crash on the non-integer suffix or silently record a bogus mapping.

How to fix

Replace the literal PLACEHOLDER on line 2502 with 1412 (this PR's number):

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1412

Step-by-step proof

  1. PR metadata shows this is PR #1412 (<pr number="1412"> in the PR header).
  2. perf-changelog.yaml:2502 in the diff contains pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER.
  3. Constructing the URL: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER — GitHub's pull-request URL parser expects an integer at that path position; PLACEHOLDER is not an integer, so the route returns 404.
  4. Compare to the immediately preceding entry at line 2493–2496 which correctly uses pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1271 — that link resolves to the actual PR.
  5. The correct value is therefore /pull/1412, and the fix is a one-character-region edit on line 2502.

@seungrokj seungrokj added the AMD label May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant