[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11 by functionstackx · Pull Request #1498 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T16:27:08Z

Summary

Adds --exclude=mia1-p01-g09,mia1-p01-g11 to the salloc in runners/launch_mi355x-amds.sh. Mirrors the same pattern as #1462 (mi300x, --nodelist=) and #1477 (mi325x, --exclude=chi-mi325x-pod1-121).

Root cause (from sweeps on #1431/#1432/#1440/#1441/#1443)

Every failure landed on one of two broken nodes:

Node	Error
`mia1-p01-g09`	`pyxis: failed to create container filesystem` — extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs
`mia1-p01-g11`	`permission denied while trying to connect to docker.sock` — cluster-cleanup `docker stop` step fails, cascading into pyxis init

Both are already documented as known-bad in KLAUD_DEBUG.md §5.1 (g09: "persistently drained — pyxis broken") and §5.2 (g11: "docker socket perms"), but the launcher wasn't excluding them. Failures were 100% deterministic — the same image works cleanly on every other mi355x node (e.g. g16, g08, etc.).

Other actions taken

After this PR merges, the 5 affected mi355x PRs (#1431/#1432/#1440/#1441/#1443) will be rebased to pick up the launcher fix and re-trigger their sweeps on healthy nodes only.

Test plan

bash -n runners/launch_mi355x-amds.sh syntax-checks.
After merge: rebased Update dsr1-fp4-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1431/Update dsr1-fp8-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1432/Update glm5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1440/[Handoff to @Oseltamivir Claude /loop] Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1441/Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1443 sweeps complete without landing on g09/g11.

🤖 Generated with Claude Code

Root-caused via the failed sweeps on #1431, #1432, #1440, #1441, #1443 — every failure landed on either: mia1-p01-g09 pyxis: failed to create container filesystem (extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs) mia1-p01-g11 permission denied while trying to connect to docker.sock (cluster-cleanup `docker stop` step fails; cascading into pyxis-init failure) Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the launcher wasn't excluding them. This mirrors the existing pattern in runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121). Once this lands the 5 affected mi355x PRs can be rebased to pick it up and the failed jobs will land on healthy nodes only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

LGTM — small, mechanical exclude-list addition mirroring the same pattern used for mi300x/mi325x in #1462/#1477.

Extended reasoning...

Overview

This PR adds --exclude=mia1-p01-g09,mia1-p01-g11 to a single salloc invocation in runners/launch_mi355x-amds.sh. The change is one line of code plus an explanatory comment block referencing the KLAUD_DEBUG sections that root-cause each node's failure mode. No other logic is touched.

Security risks

None. This is a SLURM scheduling hint that narrows the pool of compute nodes a job can land on — no auth, crypto, permissions, or data-handling surface is affected. The change can only result in jobs being scheduled on fewer nodes, not more.

Level of scrutiny

Low. This is operational/runner config in CI-launcher territory, not production-critical inference code. The change is mechanical, follows an established pattern visible in recent commit history (48c3388 is the same change; sibling launchers got similar treatment via #1462 and #1477), and is well-documented inline. The blast radius of a mistake is at worst that jobs get scheduled on broken nodes (i.e. the current state).

Other factors

The bug hunter flagged a nit suggesting the exclude list should also cover mia1-p01-g12 and mia1-p01-g31 per KLAUD_DEBUG §5.2. The reporter explicitly classified this as a follow-up rather than a regression — the PR strictly improves the baseline. It is reasonable to land this change as-is and address the additional nodes in a separate PR if/when they actually start showing up in sweep failures. Approving.

claude · 2026-05-18T16:32:33Z

+    # Exclude known-bad mi355x compute nodes (KLAUD_DEBUG §5.1 / §5.2):
+    #   mia1-p01-g09: pyxis broken (persistently fails to create container filesystem)
+    #   mia1-p01-g11: docker.sock permissions denied (cluster-cleanup step fails)
+    # Both have been root-caused via #1431/#1432/#1440/#1441/#1443 sweep failures.
+    salloc --partition=$PARTITION --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=500 --no-shell --job-name="$RUNNER_NAME"


🟡 The new --exclude=mia1-p01-g09,mia1-p01-g11 only covers 1 of the 3 nodes that KLAUD_DEBUG.md §5.2 explicitly groups as sharing the docker.sock-permissions failure (mia1-p01-g11 / g12 / g31). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very next srun ... docker stop $(docker ps -a -q) (line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to --exclude=mia1-p01-g[09,11,12,31] (or comma-separated equivalent).

Extended reasoning...

What the bug is

This PR adds --exclude=mia1-p01-g09,mia1-p01-g11 to the salloc on line 191 of runners/launch_mi355x-amds.sh, citing KLAUD_DEBUG §5.1 / §5.2 as the justification. However, §5.2 of that very file (lines 114-116) explicitly groups three nodes together as sharing the identical failure mode:

5.2 mia1-p01-g11 / g12 / g31 — docker socket perms

Symptom: mi355x jobs fail with permission denied while trying to connect to the docker API at unix:///var/run/docker.sock during the docker stop $(docker ps -a -q) cleanup step, cascading into SLURM job expiration.
Fix: ops needs to fix docker group / socket perms on these nodes. Recipe-level workaround: none.

The PR only excludes g11, leaving g12 and g31 reachable by SLURM with the documented identical defect.

Why existing code does not prevent this

The g19/g37 nodes from §5.1 don't need to be in --exclude because §5.1 says they are kept in State=DRAIN/DOWN by ops, so salloc won't allocate to them anyway. But §5.2 makes no such claim about g12/g31 — it explicitly states "Recipe-level workaround: none", meaning they are not drained at the SLURM level. The very fact that the PR had to add g09 to --exclude despite §5.1 calling it "persistently drained" demonstrates the drain state is unreliable in practice (consistent with §5.6 about DYNAMIC_NORM nodes auto-clearing DRAIN).

Impact / proof

Walk through the failure case:

The runner script is invoked, e.g. via one of the affected PRs (Update dsr1-fp4-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1431/Update dsr1-fp8-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1432/Update glm5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1440/[Handoff to @Oseltamivir Claude /loop] Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1441/Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1443).

salloc --partition=compute --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP ... is issued (line 191).

SLURM picks an available node. Since g12 and g31 are not in --exclude and are not drained per §5.2, they remain in the allocation pool alongside healthy nodes.

Suppose SLURM picks mia1-p01-g12.

The next command runs: srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)" (line 197).

On g12, docker ps -a -q fails with permission denied while trying to connect to the docker API at unix:///var/run/docker.sock — the exact symptom from §5.2.

The cleanup step exits non-zero, cascading into SLURM job expiration — the identical failure the PR is trying to prevent for the g11 case.

The PR's empirical argument ("every failure landed on g09/g11 across 5 sweep PRs") is sampling luck on a ~12-node pool with several already drained. With only ~5 trials, observing g12/g31 0 times has a non-trivial probability even if the underlying defect is present, and §5.2 explicitly says it is present.

How to fix

Extend the exclude list to cover all three §5.2 nodes:

salloc --partition=$PARTITION --exclude=mia1-p01-g[09,11,12,31] ...

or equivalently --exclude=mia1-p01-g09,mia1-p01-g11,mia1-p01-g12,mia1-p01-g31. The inline comment should be updated correspondingly. This is a follow-up improvement rather than a regression — the PR strictly improves the baseline, so it does not need to block on this, but the documented gap is worth closing in the same change.

…weep race (#1510) The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 18, 2026 16:27

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

functionstackx merged commit 3331249 into main May 18, 2026

functionstackx deleted the fix-mi355x-exclude-broken-nodes branch May 18, 2026 16:27

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race #1510

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11#1498

[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11#1498
functionstackx merged 1 commit into
mainfrom
fix-mi355x-exclude-broken-nodes

functionstackx commented May 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Root cause (from sweeps on #1431/#1432/#1440/#1441/#1443)

Other actions taken

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

What the bug is

5.2 mia1-p01-g11 / g12 / g31 — docker socket perms

Why existing code does not prevent this

Impact / proof

How to fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

5.2 `mia1-p01-g11 / g12 / g31` — docker socket perms