[Klaud Cold] runners(mi300x): pin salloc to known-good nodes by functionstackx · Pull Request #1462 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T01:24:15Z

Summary

Adds an explicit --nodelist=chi-mi300x-[034-036,054,057-058].ord.vultr.cpe.ice.amd.com to the mi300x salloc, mirroring the pattern already used in runners/launch_b300-nv.sh:336.
Three of the nine mi300x nodes are currently unusable:
- chi-mi300x-033, chi-mi300x-037 — down (Not responding)
- chi-mi300x-049 — drained for persistent /nvme_home disk-full (kept down by a watchdog re-applying State=DOWN every 10s)
Symptom this fixes: PRs land on a doomed node and fail at pyxis extraction (No space left on device) or srun: Node failure. See Update qwen3.5-bf16-mi300x-sglang SGLang image to v0.5.12-rocm720-mi30x #1426 and [Handoff to @Oseltamivir Claude /loop] Update gptoss-fp4-mi300x-vllm vLLM ROCm image to v0.21.0 #1403 for current examples.

Test plan

Subsequent mi300x sweeps land only on the 6 healthy nodes.
No Node failure / pyxis ... No space left on device errors after this lands.

🤖 Generated with Claude Code

Three of the nine mi300x compute nodes are currently unusable: - chi-mi300x-033, chi-mi300x-037: down (Not responding) - chi-mi300x-049: drained for persistent /nvme_home disk-full (kept down by a watchdog re-applying State=DOWN every 10s) Without a nodelist filter, salloc sometimes lands a job on a node that's about to be drained or that has a half-extracted enroot dir, causing 'pyxis: failed to create container filesystem (No space left on device)' / 'srun: Node failure' / 'manifest unknown'-style errors visible in PRs #1426 and #1403. Add an explicit --nodelist of the 6 healthy nodes (mirroring how runners/launch_b300-nv.sh:336 pins to the known-good B300 set).

claude

LGTM — small ops fix pinning salloc to known-good mi300x nodes, mirroring the existing pattern in launch_b300-nv.sh.

Extended reasoning...

Overview

This PR adds a --nodelist constraint to the salloc invocation in runners/launch_mi300x-amds.sh, restricting allocations to six known-good mi300x nodes (034-036, 054, 057-058) and excluding three currently-unusable nodes (033, 037 down; 049 drained for disk-full). The pattern mirrors an existing usage in runners/launch_b300-nv.sh.

Security risks

None. This is a runner orchestration script for CI benchmark jobs; the change is a Slurm scheduling constraint with no security-sensitive surface (no auth, crypto, permissions, or user-controllable input affected).

Level of scrutiny

Low. This is a one-line operational hotfix to an internal runner script. The change is mechanical, the rationale is documented inline as a comment, and the impact is bounded: at worst, salloc will fail to allocate if all listed nodes are busy, which is the same failure surface as today when a doomed node is picked.

Other factors

No bugs were reported by the bug hunting system.
The hardcoded nodelist is a known tradeoff for a hotfix — future node health changes will require another PR — but that maintenance pattern is already in use elsewhere in this repo.
No prior reviewer comments to address.

Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

) Root-caused via the failed sweeps on #1431, #1432, #1440, #1441, #1443 — every failure landed on either: mia1-p01-g09 pyxis: failed to create container filesystem (extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs) mia1-p01-g11 permission denied while trying to connect to docker.sock (cluster-cleanup `docker stop` step fails; cascading into pyxis-init failure) Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the launcher wasn't excluding them. This mirrors the existing pattern in runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121). Once this lands the 5 affected mi355x PRs can be rebased to pick it up and the failed jobs will land on healthy nodes only. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…weep race (#1510) The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 18, 2026 01:24

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

functionstackx merged commit 4da367c into main May 18, 2026
4 checks passed

functionstackx deleted the add-mi300x-nodelist-filter branch May 18, 2026 01:24

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] runners(mi325x): exclude broken enroot node chi-mi325x-pod1-121 #1477

Merged

2 tasks

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11 #1498

Merged

2 tasks

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race #1510

Merged

2 tasks

claude Bot mentioned this pull request May 20, 2026

[Klaud Cold] mi300x runner: switch --nodelist pin to --exclude -049 #1532

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462
functionstackx merged 1 commit into
mainfrom
add-mi300x-nodelist-filter

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Test plan

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant