[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11#1498
Conversation
Root-caused via the failed sweeps on #1431, #1432, #1440, #1441, #1443 — every failure landed on either: mia1-p01-g09 pyxis: failed to create container filesystem (extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs) mia1-p01-g11 permission denied while trying to connect to docker.sock (cluster-cleanup `docker stop` step fails; cascading into pyxis-init failure) Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the launcher wasn't excluding them. This mirrors the existing pattern in runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121). Once this lands the 5 affected mi355x PRs can be rebased to pick it up and the failed jobs will land on healthy nodes only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM — small, mechanical exclude-list addition mirroring the same pattern used for mi300x/mi325x in #1462/#1477.
Extended reasoning...
Overview
This PR adds --exclude=mia1-p01-g09,mia1-p01-g11 to a single salloc invocation in runners/launch_mi355x-amds.sh. The change is one line of code plus an explanatory comment block referencing the KLAUD_DEBUG sections that root-cause each node's failure mode. No other logic is touched.
Security risks
None. This is a SLURM scheduling hint that narrows the pool of compute nodes a job can land on — no auth, crypto, permissions, or data-handling surface is affected. The change can only result in jobs being scheduled on fewer nodes, not more.
Level of scrutiny
Low. This is operational/runner config in CI-launcher territory, not production-critical inference code. The change is mechanical, follows an established pattern visible in recent commit history (48c3388 is the same change; sibling launchers got similar treatment via #1462 and #1477), and is well-documented inline. The blast radius of a mistake is at worst that jobs get scheduled on broken nodes (i.e. the current state).
Other factors
The bug hunter flagged a nit suggesting the exclude list should also cover mia1-p01-g12 and mia1-p01-g31 per KLAUD_DEBUG §5.2. The reporter explicitly classified this as a follow-up rather than a regression — the PR strictly improves the baseline. It is reasonable to land this change as-is and address the additional nodes in a separate PR if/when they actually start showing up in sweep failures. Approving.
| # Exclude known-bad mi355x compute nodes (KLAUD_DEBUG §5.1 / §5.2): | ||
| # mia1-p01-g09: pyxis broken (persistently fails to create container filesystem) | ||
| # mia1-p01-g11: docker.sock permissions denied (cluster-cleanup step fails) | ||
| # Both have been root-caused via #1431/#1432/#1440/#1441/#1443 sweep failures. | ||
| salloc --partition=$PARTITION --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=500 --no-shell --job-name="$RUNNER_NAME" |
There was a problem hiding this comment.
🟡 The new --exclude=mia1-p01-g09,mia1-p01-g11 only covers 1 of the 3 nodes that KLAUD_DEBUG.md §5.2 explicitly groups as sharing the docker.sock-permissions failure (mia1-p01-g11 / g12 / g31). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very next srun ... docker stop $(docker ps -a -q) (line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to --exclude=mia1-p01-g[09,11,12,31] (or comma-separated equivalent).
Extended reasoning...
What the bug is
This PR adds --exclude=mia1-p01-g09,mia1-p01-g11 to the salloc on line 191 of runners/launch_mi355x-amds.sh, citing KLAUD_DEBUG §5.1 / §5.2 as the justification. However, §5.2 of that very file (lines 114-116) explicitly groups three nodes together as sharing the identical failure mode:
5.2
mia1-p01-g11 / g12 / g31— docker socket permsSymptom: mi355x jobs fail with
permission denied while trying to connect to the docker API at unix:///var/run/docker.sockduring thedocker stop $(docker ps -a -q)cleanup step, cascading into SLURM job expiration.
Fix: ops needs to fix docker group / socket perms on these nodes. Recipe-level workaround: none.
The PR only excludes g11, leaving g12 and g31 reachable by SLURM with the documented identical defect.
Why existing code does not prevent this
The g19/g37 nodes from §5.1 don't need to be in --exclude because §5.1 says they are kept in State=DRAIN/DOWN by ops, so salloc won't allocate to them anyway. But §5.2 makes no such claim about g12/g31 — it explicitly states "Recipe-level workaround: none", meaning they are not drained at the SLURM level. The very fact that the PR had to add g09 to --exclude despite §5.1 calling it "persistently drained" demonstrates the drain state is unreliable in practice (consistent with §5.6 about DYNAMIC_NORM nodes auto-clearing DRAIN).
Impact / proof
Walk through the failure case:
- The runner script is invoked, e.g. via one of the affected PRs (Update dsr1-fp4-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1431/Update dsr1-fp8-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1432/Update glm5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1440/[Handoff to @Oseltamivir Claude /loop] Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1441/Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1443).
salloc --partition=compute --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP ...is issued (line 191).- SLURM picks an available node. Since
g12andg31are not in--excludeand are not drained per §5.2, they remain in the allocation pool alongside healthy nodes. - Suppose SLURM picks
mia1-p01-g12. - The next command runs:
srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)"(line 197). - On
g12,docker ps -a -qfails withpermission denied while trying to connect to the docker API at unix:///var/run/docker.sock— the exact symptom from §5.2. - The cleanup step exits non-zero, cascading into SLURM job expiration — the identical failure the PR is trying to prevent for the
g11case.
The PR's empirical argument ("every failure landed on g09/g11 across 5 sweep PRs") is sampling luck on a ~12-node pool with several already drained. With only ~5 trials, observing g12/g31 0 times has a non-trivial probability even if the underlying defect is present, and §5.2 explicitly says it is present.
How to fix
Extend the exclude list to cover all three §5.2 nodes:
salloc --partition=$PARTITION --exclude=mia1-p01-g[09,11,12,31] ...or equivalently --exclude=mia1-p01-g09,mia1-p01-g11,mia1-p01-g12,mia1-p01-g31. The inline comment should be updated correspondingly. This is a follow-up improvement rather than a regression — the PR strictly improves the baseline, so it does not need to block on this, but the documented gap is worth closing in the same change.
…weep race (#1510) The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
--exclude=mia1-p01-g09,mia1-p01-g11to the salloc inrunners/launch_mi355x-amds.sh. Mirrors the same pattern as #1462 (mi300x,--nodelist=) and #1477 (mi325x,--exclude=chi-mi325x-pod1-121).Root cause (from sweeps on #1431/#1432/#1440/#1441/#1443)
Every failure landed on one of two broken nodes:
mia1-p01-g09pyxis: failed to create container filesystem— extended attributes not supported on the destination filesystem; pyxis can't mount the squashfsmia1-p01-g11permission denied while trying to connect to docker.sock— cluster-cleanupdocker stopstep fails, cascading into pyxis initBoth are already documented as known-bad in
KLAUD_DEBUG.md §5.1(g09: "persistently drained — pyxis broken") and§5.2(g11: "docker socket perms"), but the launcher wasn't excluding them. Failures were 100% deterministic — the same image works cleanly on every other mi355x node (e.g.g16,g08, etc.).Other actions taken
After this PR merges, the 5 affected mi355x PRs (#1431/#1432/#1440/#1441/#1443) will be rebased to pick up the launcher fix and re-trigger their sweeps on healthy nodes only.
Test plan
bash -n runners/launch_mi355x-amds.shsyntax-checks.🤖 Generated with Claude Code