[Klaud Cold] runners(mi325x): exclude broken enroot node chi-mi325x-pod1-121#1477
Conversation
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM — single-line salloc exclude mirroring the mi300x fix in #1462, well-documented inline.
Extended reasoning...
Overview
This PR adds --exclude=chi-mi325x-pod1-121.ord.vultr.cpe.ice.amd.com to the salloc invocation in runners/launch_mi325x-amds.sh. It is a one-line operational fix that prevents the launcher from scheduling jobs on a node whose enroot setcap is broken, with an explanatory comment block pointing to the diagnosis PRs.
Security risks
None. The change is purely a SLURM node-exclusion flag for an internal benchmarking runner. No auth, secrets, network exposure, or shell-injection surface is altered (the excluded hostname is a static literal).
Level of scrutiny
Very low. This is exactly the same pattern used for the mi300x runner in #1462 and follows the operational playbook in KLAUD_DEBUG.md §5.6. The node has additionally been drained on the SLURM controller with a watchdog, so this exclude is belt-and-suspenders.
Other factors
The author syntax-checked the script (bash -n), the inline comment names the offending node and links to the failing sweeps, and no bugs were flagged by the hunting system. Safe to shadow-approve.
) Root-caused via the failed sweeps on #1431, #1432, #1440, #1441, #1443 — every failure landed on either: mia1-p01-g09 pyxis: failed to create container filesystem (extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs) mia1-p01-g11 permission denied while trying to connect to docker.sock (cluster-cleanup `docker stop` step fails; cascading into pyxis-init failure) Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the launcher wasn't excluding them. This mirrors the existing pattern in runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121). Once this lands the 5 affected mi355x PRs can be rebased to pick it up and the failed jobs will land on healthy nodes only. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…weep race (#1510) The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds
--exclude=chi-mi325x-pod1-121.ord.vultr.cpe.ice.amd.comto the salloc inrunners/launch_mi325x-amds.sh. Mirrors the mi300x infra-pin fix in #1462, but uses--exclude(since only one node is currently broken) so the launcher doesn't have to maintain an allow-list.Root cause (from #1467 / #1468 / #1469 failed sweeps)
All three open
[Klaud Cold]mi325xvllm/vllm-openai-rocm:v0.21.0bumps failed identically — every failure landed onchi-mi325x-pod1-121, all hitting:during
enroot import(runners/launch_mi325x-amds.sh:27). The squash file is never created, so the second srun then fails withNo such file or directory. The bench script never even starts — so it's not OOM and not a recipe-level issue.The same image (
vllm/vllm-openai-rocm:v0.21.0) works cleanly on every other up node (017/018/019/020/027) — confirmed in the matching successful matrix runs (e.g.tp=8ran at--gpu-memory-utilization=0.95with 223 GiB KV-cache headroom).PR-comment diagnoses:
Other actions taken
pod1-121setState=DRAINon the controller viascontrol, with a 10-second watchdog at/home/gharunner/_audit/drain_pod1-121_watchdog.shre-applying the drain if SLURM auto-clears it (same pattern as KLAUD_DEBUG.md §5.6 for chi-mi300x-049). Holds until ops fix the underlyingsetcapregression on the node.Test plan
bash -n runners/launch_mi325x-amds.shsyntax-checks.pod1-121.🤖 Generated with Claude Code