Skip to content

[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11#1498

Merged
functionstackx merged 1 commit into
mainfrom
fix-mi355x-exclude-broken-nodes
May 18, 2026
Merged

[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11#1498
functionstackx merged 1 commit into
mainfrom
fix-mi355x-exclude-broken-nodes

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Adds --exclude=mia1-p01-g09,mia1-p01-g11 to the salloc in runners/launch_mi355x-amds.sh. Mirrors the same pattern as #1462 (mi300x, --nodelist=) and #1477 (mi325x, --exclude=chi-mi325x-pod1-121).

Root cause (from sweeps on #1431/#1432/#1440/#1441/#1443)

Every failure landed on one of two broken nodes:

Node Error
mia1-p01-g09 pyxis: failed to create container filesystem — extended attributes not supported on the destination filesystem; pyxis can't mount the squashfs
mia1-p01-g11 permission denied while trying to connect to docker.sock — cluster-cleanup docker stop step fails, cascading into pyxis init

Both are already documented as known-bad in KLAUD_DEBUG.md §5.1 (g09: "persistently drained — pyxis broken") and §5.2 (g11: "docker socket perms"), but the launcher wasn't excluding them. Failures were 100% deterministic — the same image works cleanly on every other mi355x node (e.g. g16, g08, etc.).

Other actions taken

After this PR merges, the 5 affected mi355x PRs (#1431/#1432/#1440/#1441/#1443) will be rebased to pick up the launcher fix and re-trigger their sweeps on healthy nodes only.

Test plan

🤖 Generated with Claude Code

Root-caused via the failed sweeps on #1431, #1432, #1440, #1441,
#1443 — every failure landed on either:

  mia1-p01-g09  pyxis: failed to create container filesystem
                (extended attributes not supported on the destination
                filesystem; pyxis can't mount the squashfs)
  mia1-p01-g11  permission denied while trying to connect to docker.sock
                (cluster-cleanup `docker stop` step fails; cascading
                into pyxis-init failure)

Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the
launcher wasn't excluding them. This mirrors the existing pattern in
runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and
runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121).

Once this lands the 5 affected mi355x PRs can be rebased to pick it up
and the failed jobs will land on healthy nodes only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx requested a review from a team May 18, 2026 16:27
@functionstackx functionstackx merged commit 3331249 into main May 18, 2026
@functionstackx functionstackx deleted the fix-mi355x-exclude-broken-nodes branch May 18, 2026 16:27
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, mechanical exclude-list addition mirroring the same pattern used for mi300x/mi325x in #1462/#1477.

Extended reasoning...

Overview

This PR adds --exclude=mia1-p01-g09,mia1-p01-g11 to a single salloc invocation in runners/launch_mi355x-amds.sh. The change is one line of code plus an explanatory comment block referencing the KLAUD_DEBUG sections that root-cause each node's failure mode. No other logic is touched.

Security risks

None. This is a SLURM scheduling hint that narrows the pool of compute nodes a job can land on — no auth, crypto, permissions, or data-handling surface is affected. The change can only result in jobs being scheduled on fewer nodes, not more.

Level of scrutiny

Low. This is operational/runner config in CI-launcher territory, not production-critical inference code. The change is mechanical, follows an established pattern visible in recent commit history (48c3388 is the same change; sibling launchers got similar treatment via #1462 and #1477), and is well-documented inline. The blast radius of a mistake is at worst that jobs get scheduled on broken nodes (i.e. the current state).

Other factors

The bug hunter flagged a nit suggesting the exclude list should also cover mia1-p01-g12 and mia1-p01-g31 per KLAUD_DEBUG §5.2. The reporter explicitly classified this as a follow-up rather than a regression — the PR strictly improves the baseline. It is reasonable to land this change as-is and address the additional nodes in a separate PR if/when they actually start showing up in sweep failures. Approving.

Comment on lines +190 to +194
# Exclude known-bad mi355x compute nodes (KLAUD_DEBUG §5.1 / §5.2):
# mia1-p01-g09: pyxis broken (persistently fails to create container filesystem)
# mia1-p01-g11: docker.sock permissions denied (cluster-cleanup step fails)
# Both have been root-caused via #1431/#1432/#1440/#1441/#1443 sweep failures.
salloc --partition=$PARTITION --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=500 --no-shell --job-name="$RUNNER_NAME"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new --exclude=mia1-p01-g09,mia1-p01-g11 only covers 1 of the 3 nodes that KLAUD_DEBUG.md §5.2 explicitly groups as sharing the docker.sock-permissions failure (mia1-p01-g11 / g12 / g31). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very next srun ... docker stop $(docker ps -a -q) (line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to --exclude=mia1-p01-g[09,11,12,31] (or comma-separated equivalent).

Extended reasoning...

What the bug is

This PR adds --exclude=mia1-p01-g09,mia1-p01-g11 to the salloc on line 191 of runners/launch_mi355x-amds.sh, citing KLAUD_DEBUG §5.1 / §5.2 as the justification. However, §5.2 of that very file (lines 114-116) explicitly groups three nodes together as sharing the identical failure mode:

5.2 mia1-p01-g11 / g12 / g31 — docker socket perms

Symptom: mi355x jobs fail with permission denied while trying to connect to the docker API at unix:///var/run/docker.sock during the docker stop $(docker ps -a -q) cleanup step, cascading into SLURM job expiration.
Fix: ops needs to fix docker group / socket perms on these nodes. Recipe-level workaround: none.

The PR only excludes g11, leaving g12 and g31 reachable by SLURM with the documented identical defect.

Why existing code does not prevent this

The g19/g37 nodes from §5.1 don't need to be in --exclude because §5.1 says they are kept in State=DRAIN/DOWN by ops, so salloc won't allocate to them anyway. But §5.2 makes no such claim about g12/g31 — it explicitly states "Recipe-level workaround: none", meaning they are not drained at the SLURM level. The very fact that the PR had to add g09 to --exclude despite §5.1 calling it "persistently drained" demonstrates the drain state is unreliable in practice (consistent with §5.6 about DYNAMIC_NORM nodes auto-clearing DRAIN).

Impact / proof

Walk through the failure case:

  1. The runner script is invoked, e.g. via one of the affected PRs (Update dsr1-fp4-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1431/Update dsr1-fp8-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1432/Update glm5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1440/[Handoff to @Oseltamivir Claude /loop] Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1441/Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1443).
  2. salloc --partition=compute --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP ... is issued (line 191).
  3. SLURM picks an available node. Since g12 and g31 are not in --exclude and are not drained per §5.2, they remain in the allocation pool alongside healthy nodes.
  4. Suppose SLURM picks mia1-p01-g12.
  5. The next command runs: srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)" (line 197).
  6. On g12, docker ps -a -q fails with permission denied while trying to connect to the docker API at unix:///var/run/docker.sock — the exact symptom from §5.2.
  7. The cleanup step exits non-zero, cascading into SLURM job expiration — the identical failure the PR is trying to prevent for the g11 case.

The PR's empirical argument ("every failure landed on g09/g11 across 5 sweep PRs") is sampling luck on a ~12-node pool with several already drained. With only ~5 trials, observing g12/g31 0 times has a non-trivial probability even if the underlying defect is present, and §5.2 explicitly says it is present.

How to fix

Extend the exclude list to cover all three §5.2 nodes:

salloc --partition=$PARTITION --exclude=mia1-p01-g[09,11,12,31] ...

or equivalently --exclude=mia1-p01-g09,mia1-p01-g11,mia1-p01-g12,mia1-p01-g31. The inline comment should be updated correspondingly. This is a follow-up improvement rather than a regression — the PR strictly improves the baseline, so it does not need to block on this, but the documented gap is worth closing in the same change.

functionstackx added a commit that referenced this pull request May 18, 2026
…weep race (#1510)

The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE`
without any locking, so when multiple sweep jobs landed on the cluster
simultaneously they all tried to import the same image into the shared
NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the
rest crash with `[ERROR] File already exists: ...sqsh` and
`OSError: [Errno 116] Stale file handle` (from the partial sqsh) once
sglang/vllm tries to start.

Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs
failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure
rate scales with sweep concurrency — was masked previously because
older H100 recipes had fewer matrix points sharing the cluster.

Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid +
enroot import` pattern already used in launch_h100-cw.sh, plus the
mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior
change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant