Skip to content

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462

Merged
functionstackx merged 1 commit into
mainfrom
add-mi300x-nodelist-filter
May 18, 2026
Merged

[Klaud Cold] runners(mi300x): pin salloc to known-good nodes#1462
functionstackx merged 1 commit into
mainfrom
add-mi300x-nodelist-filter

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Test plan

  • Subsequent mi300x sweeps land only on the 6 healthy nodes.
  • No Node failure / pyxis ... No space left on device errors after this lands.

🤖 Generated with Claude Code

Three of the nine mi300x compute nodes are currently unusable:
  - chi-mi300x-033, chi-mi300x-037: down (Not responding)
  - chi-mi300x-049: drained for persistent /nvme_home disk-full
    (kept down by a watchdog re-applying State=DOWN every 10s)

Without a nodelist filter, salloc sometimes lands a job on a node
that's about to be drained or that has a half-extracted enroot dir,
causing 'pyxis: failed to create container filesystem (No space left
on device)' / 'srun: Node failure' / 'manifest unknown'-style errors
visible in PRs #1426 and #1403.

Add an explicit --nodelist of the 6 healthy nodes (mirroring how
runners/launch_b300-nv.sh:336 pins to the known-good B300 set).
@functionstackx functionstackx requested a review from a team May 18, 2026 01:24
@functionstackx functionstackx merged commit 4da367c into main May 18, 2026
4 checks passed
@functionstackx functionstackx deleted the add-mi300x-nodelist-filter branch May 18, 2026 01:24
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small ops fix pinning salloc to known-good mi300x nodes, mirroring the existing pattern in launch_b300-nv.sh.

Extended reasoning...

Overview

This PR adds a --nodelist constraint to the salloc invocation in runners/launch_mi300x-amds.sh, restricting allocations to six known-good mi300x nodes (034-036, 054, 057-058) and excluding three currently-unusable nodes (033, 037 down; 049 drained for disk-full). The pattern mirrors an existing usage in runners/launch_b300-nv.sh.

Security risks

None. This is a runner orchestration script for CI benchmark jobs; the change is a Slurm scheduling constraint with no security-sensitive surface (no auth, crypto, permissions, or user-controllable input affected).

Level of scrutiny

Low. This is a one-line operational hotfix to an internal runner script. The change is mechanical, the rationale is documented inline as a comment, and the impact is bounded: at worst, salloc will fail to allocate if all listed nodes are busy, which is the same failure surface as today when a doomed node is picked.

Other factors

  • No bugs were reported by the bug hunting system.
  • The hardcoded nodelist is a known tradeoff for a hotfix — future node health changes will require another PR — but that maintenance pattern is already in use elsewhere in this repo.
  • No prior reviewer comments to address.

functionstackx added a commit that referenced this pull request May 18, 2026
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three
[Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every
failure landed on chi-mi325x-pod1-121 with

  enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted

before the .sqsh import even completes; subsequent pyxis mount then
fails with "No such file or directory". The same image works cleanly
on every other up node (017/018/019/020/027) — confirmed not OOM and
not a recipe issue.

This matches the existing pattern for mi300x in #1462 (pin salloc away
from chronically-bad nodes); for mi325x there's currently only the one
node to exclude, so use --exclude rather than --nodelist so we don't
have to maintain the allow-list as nodes come and go.

pod1-121 has separately been drained on the controller with a watchdog
(per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix
the underlying setcap regression.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request May 18, 2026
)

Root-caused via the failed sweeps on #1431, #1432, #1440, #1441,
#1443 — every failure landed on either:

  mia1-p01-g09  pyxis: failed to create container filesystem
                (extended attributes not supported on the destination
                filesystem; pyxis can't mount the squashfs)
  mia1-p01-g11  permission denied while trying to connect to docker.sock
                (cluster-cleanup `docker stop` step fails; cascading
                into pyxis-init failure)

Both are already known-bad per KLAUD_DEBUG.md §5.1 / §5.2, but the
launcher wasn't excluding them. This mirrors the existing pattern in
runners/launch_mi300x-amds.sh (#1462 — pin to known-good nodes) and
runners/launch_mi325x-amds.sh (#1477 — exclude chi-mi325x-pod1-121).

Once this lands the 5 affected mi355x PRs can be rebased to pick it up
and the failed jobs will land on healthy nodes only.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request May 18, 2026
…weep race (#1510)

The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE`
without any locking, so when multiple sweep jobs landed on the cluster
simultaneously they all tried to import the same image into the shared
NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the
rest crash with `[ERROR] File already exists: ...sqsh` and
`OSError: [Errno 116] Stale file handle` (from the partial sqsh) once
sglang/vllm tries to start.

Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs
failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure
rate scales with sweep concurrency — was masked previously because
older H100 recipes had fewer matrix points sharing the cluster.

Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid +
enroot import` pattern already used in launch_h100-cw.sh, plus the
mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior
change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant