Skip to content

Update dpskv4 GB300 MTP disagg SGLang image to nightly-20260519#1478

Open
yhyang201 wants to merge 2 commits into
mainfrom
dpskv4-gb300-sglang-v0.5.12
Open

Update dpskv4 GB300 MTP disagg SGLang image to nightly-20260519#1478
yhyang201 wants to merge 2 commits into
mainfrom
dpskv4-gb300-sglang-v0.5.12

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

Summary

  • Update SGLang container image from nightly-dev-cu13-20260510-2473659e to v0.5.12 for all MTP configs (mid-curve and low-latency)
  • Remove obsolete/redundant environment variables that no longer exist in sglang release/v0.5.12 or whose defaults already match the set values

@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 perf-changelog.yaml:2661 — The new perf-changelog.yaml entry sets pr-link to .../pull/1477, but this is PR #1478 (PR #1477 is an unrelated change). Likely a copy/paste typo — please update the link to https://github.com/SemiAnalysisAI/InferenceX/pull/1478 so the changelog correctly attributes this change.

    Extended reasoning...

    What the bug is

    The new entry appended to perf-changelog.yaml (lines 2655–2661 in the post-PR file) sets:

    - config-keys:
        - dpskv4-gb300-sglang
      description:
        - "Update SGLang image to v0.5.12 and remove obsolete env vars for MTP configs"
      pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477

    However, this PR is #1478, not #1477. PR #1477 is a separate, unrelated change. The pr-link should point to #1478 so that the changelog entry correctly identifies the PR that introduced the dpskv4-gb300-sglang image bump.

    How it manifests / why existing code doesn't prevent it

    perf-changelog.yaml is descriptive metadata used for provenance/audit — there is no schema validator that cross-references pr-link against the PR number in which the entry is added, so the typo passes CI silently. The convention is established by every prior entry though: each pr-link matches the PR that merged it (e.g., the immediately preceding entry on line ~2649 references /pull/1458, which matches commit 3542221 [Klaud Cold] Update qwen3.5-fp8-h200-sglang SGLang image to v0.5.12-cu130 (#1458)).

    Step-by-step proof

    1. PR metadata for this review unambiguously identifies the PR as <pr number='1478'>.
    2. The diff for perf-changelog.yaml adds one new entry at the bottom whose only externally-resolvable identifier is pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477.
    3. 1477 ≠ 1478, so a reader who clicks the link from the changelog will land on a different PR that does not describe this image bump.
    4. Compare to the previous three entries in the same file:
      • line ~2643 → /pull/1427
      • line ~2649 → /pull/1459
      • line ~2655 → /pull/1458 (corresponds to recent commit 3542221)
        In every case, the pr-link matches the PR that introduced the entry. This entry breaks that convention.

    Impact

    Low — no runtime/CI impact, since process_changelog.py (and consumers of the YAML) don't validate pr-link. The cost is purely traceability: anyone investigating why dpskv4-gb300-sglang moved to v0.5.12 will follow the link to the wrong PR.

    Fix

    Change line 2661 from:

      pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1477
    

    to:

      pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1478
    
  • 🔴 perf-changelog.yaml:2657-2661 — The new perf-changelog entry uses config-key dpskv4-gb300-sglang, but no such key exists in .github/configs/nvidia-master.yaml — the actual MTP config is dsv4-fp4-gb300-dynamo-sglang-mtp (line 8803), and every other dsv4 changelog entry uses the dsv4- prefix. Since utils/process_changelog.py:get_config_keys_from_master does an exact-match lookup for non-wildcard keys and raises ValueError('Config key {key} not found in master configs.'), this entry will hard-fail the changelog validation in the run-sweep workflow on push-to-main and won't drive any benchmark sweep. The key should be dsv4-fp4-gb300-dynamo-sglang-mtp (or wildcarded as dsv4-fp4-gb300-dynamo-sglang*).

    Extended reasoning...

    What the bug is. The perf-changelog entry added at perf-changelog.yaml:2657-2661 uses config-keys: [dpskv4-gb300-sglang]. The string dpskv4 (with a p) appears nowhere else in the repository — not in .github/configs/nvidia-master.yaml and not in any other changelog entry. The actual MTP config for the recipe files touched by this PR is dsv4-fp4-gb300-dynamo-sglang-mtp at line 8803 of .github/configs/nvidia-master.yaml (with the non-MTP sibling dsv4-fp4-gb300-dynamo-sglang at line 8703). Every prior dsv4 changelog entry (e.g. lines 2018, 2272-2273, 2280-2281) uses the dsv4-fp4-gb300-dynamo-sglang / -mtp keys consistently.\n\nWhy it breaks CI. utils/process_changelog.py:get_config_keys_from_master (lines 85-99) iterates the config-keys list. For a key without * it does an exact-match elif key not in master_config: raise ValueError(f"Config key {key} not found in master configs."). There is no fallback or fuzzy match. The .github/workflows/run-sweep.yml workflow runs process_changelog.py against perf-changelog.yaml on changes/push to main, so this entry will hard-fail validation rather than silently noop.\n\nImpact. Two compounding effects: (1) the changelog-driven sweep tooling errors out on this entry, blocking the workflow run that the entry was meant to trigger; (2) even if the failure were tolerated, the entry would not drive benchmarks for the MTP configs it is meant to cover, so the image bump from nightly-dev-cu13-20260510-2473659ev0.5.12 (the substantive change in this PR) would go un-benchmarked.\n\nStep-by-step proof.\n1. process_changelog.py loads perf-changelog.yaml and calls get_config_keys_from_master with config_keys = ['dpskv4-gb300-sglang'] and master_config = the dict loaded from nivdia-master.yaml.\n2. The loop reaches key = 'dpskv4-gb300-sglang'. It contains no *, so the wildcard branch is skipped.\n3. 'dpskv4-gb300-sglang' not in master_config evaluates to True (master_config contains dsv4-fp4-gb300-dynamo-sglang and dsv4-fp4-gb300-dynamo-sglang-mtp, neither of which equals the queried key).\n4. raise ValueError("Config key dpskv4-gb300-sglang not found in master configs.") fires, aborting the workflow step.\n\nFix. Change the config-keys value to the real config name. Since this PR only touches MTP recipes, the precise key is dsv4-fp4-gb300-dynamo-sglang-mtp. If both MTP and non-MTP are intended to share this entry, use the wildcard dsv4-fp4-gb300-dynamo-sglang* (the same form used elsewhere in this file for matching both variants).

model:
path: "deepseek-v4-pro"
container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e"
container: "lmsysorg/sglang:v0.5.12"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The PR bumps model.container to lmsysorg/sglang:v0.5.12 in all six MTP recipe YAMLs, but the matching entry in .github/configs/nvidia-master.yaml (dsv4-fp4-gb300-dynamo-sglang-mtp at line 8803) still has image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. AGENTS.md is explicit that multi-node srt-slurm image bumps must edit recipe yaml AND nvidia-master.yaml together, and that model.container must equal image: (the launcher uses the latter as the container-alias key) — otherwise either the bump doesn't take effect or results get mislabeled. Please bump image: on line 8804 to lmsysorg/sglang:v0.5.12 as well.

Extended reasoning...

What the bug is

This PR bumps model.container from lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e to lmsysorg/sglang:v0.5.12 in all six MTP recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/. The corresponding entry in .github/configs/nvidia-master.yaml, however, was not updated. Line 8804 still reads:

dsv4-fp4-gb300-dynamo-sglang-mtp:
  image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034

That entry owns all six recipes touched by this PR: lines 8826, 8842, 8857, 8872, 8887, and 8902 contain CONFIG_FILE= references pointing at exactly the six modified recipe files (disagg-low-latency-1p1d-tp4-tp4-mtp.yaml, disagg-low-latency-1p6d-dep4-tp4-mtp.yaml, disagg-mid-curve-1p1d-dep4-dep8-mtp.yaml, disagg-mid-curve-1p1d-dep4-dep16-mtp.yaml, disagg-mid-curve-2p1d-dep4-dep8-mtp.yaml, disagg-mid-curve-4p1d-dep4-dep8-mtp.yaml).

Why this matters (explicit AGENTS.md rule)

AGENTS.md line 113 lays out the invariant directly:

Multi-node srt-slurm changes must edit the recipe yaml AND nvidia-master.yaml together. srtctl reads only the recipe (model.container, resources, prefill/decode workers); the sweep generator (utils/matrix_logic/generate_sweep_configs.py) reads nvidia-master.yaml for frontend labels — its prefill/decode numbers never reach srtctl. Recipe-only edits mislabel results, master-only edits don't take effect. For image bumps, model.container must equal image:, since the launcher uses the latter as the container-alias key.

This PR is a recipe-only edit for an image bump, which is precisely the case AGENTS.md warns against.

Step-by-step proof

  1. Apply this PR. The six recipe YAMLs now declare container: lmsysorg/sglang:v0.5.12.
  2. The CI sweep generator (generate_sweep_configs.py) is invoked with .github/configs/nvidia-master.yaml (per the docs and run-sweep.yml workflow).
  3. It reads the dsv4-fp4-gb300-dynamo-sglang-mtp entry, which still has image: lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. That string is used as the container-alias key / label.
  4. srtctl separately reads each recipe and launches lmsysorg/sglang:v0.5.12.
  5. Result: the workload actually runs on v0.5.12, but is labeled/recorded under nightly-dev-cu13-20260509-9ee83034 — i.e. mislabeled benchmark results. Conversely, if the alias key is required to match for the launcher path, the bump simply does not take effect for those configs.

Either failure mode is the exact outcome AGENTS.md describes.

How to fix

Update .github/configs/nvidia-master.yaml line 8804 to:

dsv4-fp4-gb300-dynamo-sglang-mtp:
  image: lmsysorg/sglang:v0.5.12

so that image: matches the model.container set in every recipe this entry references. (Note: the master image was ...20260509-9ee83034 while the recipes were on ...20260510-2473659e — these had already drifted, so the fix is to align both on v0.5.12.)

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator

@claude rebase & resolve conflicts

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 18, 2026

Claude finished @functionstackx's task in 1m 50s —— View job


Rebase & Resolve Conflicts

  • Fetch latest origin/main
  • Rebase dpskv4-gb300-sglang-v0.5.12 onto origin/main
  • Resolve merge conflict in perf-changelog.yaml
  • Force-push rebased branch

Rebased onto origin/main (4 new commits: #1446, #1415, #1428, #1477). There was one conflict in perf-changelog.yaml where both main and this branch appended entries at the end — resolved by keeping both (main's qwen3.5-bf16-b200-sglang entry, then this PR's dsv4-fp4-gb300-dynamo-sglang-mtp entry).

Branch is now 1 commit ahead, 0 behind main: 03c49caf.
| Branch

@Klaud-Cold Klaud-Cold force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from cc55442 to 03c49ca Compare May 18, 2026 03:53
@functionstackx
Copy link
Copy Markdown
Collaborator

@claude add the image change to nvidia-master.yaml too container: "lmsysorg/sglang:v0.5.12

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented May 18, 2026

Claude finished @functionstackx's task in 45s —— View job


Update nvidia-master.yaml Image

  • Read current nvidia-master.yaml to find the dsv4-fp4-gb300-dynamo-sglang-mtp entry
  • Update image: from nightly-dev-cu13-20260509-9ee83034 to lmsysorg/sglang:v0.5.12
  • Verified perf-changelog.yaml already has correct config-key and PR link (fixed during earlier rebase)
  • Commit and push

Updated .github/configs/nvidia-master.yaml line 8805 so image: matches the model.container in the recipe YAMLs (lmsysorg/sglang:v0.5.12). Pushed as 39575483.

View job | Branch

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch 2 times, most recently from fb1e9d4 to 4debe31 Compare May 18, 2026 15:22
@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 4debe31 to 67291ab Compare May 18, 2026 15:26
@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 67291ab to 0c7c630 Compare May 18, 2026 15:41
@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 0c7c630 to 8adc687 Compare May 18, 2026 15:59
@github-actions
Copy link
Copy Markdown
Contributor

@yhyang201 yhyang201 changed the title Update dpskv4 GB300 MTP SGLang image to v0.5.12 Update dpskv4 GB300 MTP disagg SGLang image to nightly-20260519 May 19, 2026
@yhyang201 yhyang201 force-pushed the dpskv4-gb300-sglang-v0.5.12 branch from 8adc687 to 77727b2 Compare May 19, 2026 13:24
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants