[Klaud Cold] mi300x runner: switch --nodelist pin to --exclude -049#1532
Conversation
The previous --nodelist pin only allowed allocations on a hand-picked subset (034-036, 054, 057-058) and blocked nodes that have since recovered (033, 035, 037). Switch to --exclude=chi-mi300x-049 so Slurm can pick any healthy node; -049 stays banned (persistent /nvme_home disk-full). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM — straightforward one-line Slurm allocation tweak that swaps a node pin for an exclude.
Extended reasoning...
Overview
This PR makes a single-line change to runners/launch_mi300x-amds.sh: it replaces --nodelist=chi-mi300x-[034-036,054,057-058]... with --exclude=chi-mi300x-049... in the salloc invocation, and updates the adjacent comment to match. The motivation is operational: only -049 is permanently bad (/nvme_home disk-full, held State=DOWN by a watchdog), while -033/-035/-037 are transient and should be allowed back into the pool when they recover.
Security risks
None. This is a Slurm allocation flag in a runner launch script — no auth, crypto, permissions, or user-input handling is touched. The change widens the candidate node set within an already-trusted internal cluster partition.
Level of scrutiny
Very low. This is a runner/ops config tweak, not a logic or data-path change. The exclude semantics are well-documented in Slurm and the rest of the script (squash import, srun, scancel) is untouched. If the swap is wrong, the failure mode is loud (salloc fails or a job lands on a bad node) and trivially revertable.
Other factors
The only finding from the bug hunting pass is a documentation-drift nit in KLAUD_DEBUG.md §5.3, which is already attached as an inline comment for the author to consider. That's an internal playbook update, not a runtime defect, so it doesn't block approval of the underlying behavior change.
| # Exclude known-bad nodes; let Slurm pick from anything else: | ||
| # chi-mi300x-049: drained (persistent /nvme_home disk-full) | ||
| JOB_ID=$(salloc --partition=$PARTITION --exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+') |
There was a problem hiding this comment.
🟡 Documentation drift: KLAUD_DEBUG.md §5.3 (line 121) still describes runners/launch_mi300x-amds.sh as pinning salloc to chi-mi300x-[034-036,054,057-058] via --nodelist, but after this PR the script uses --exclude=chi-mi300x-049 only. Consider updating that paragraph in the same PR so the operator playbook matches the new exclude-only strategy.
Extended reasoning...
What the drift is
KLAUD_DEBUG.md §5.3 ("chi-mi300x-049 — /nvme_home disk-full") at line 121 currently reads:
Fix already landed:
runners/launch_mi300x-amds.shnow pins salloc to only known-good mi300x nodes (chi-mi300x-[034-036,054,057-058]) — see PR #1462.
This PR replaces the --nodelist=chi-mi300x-[034-036,054,057-058]... pin with --exclude=chi-mi300x-049.... Two specific claims in §5.3 become factually wrong on merge:
- The script no longer "pins salloc to only known-good mi300x nodes" — it allows Slurm to pick any node except
-049. - The enumerated set
chi-mi300x-[034-036,054,057-058]is no longer the allowed set; per the PR description,-033/-035/-037may rejoin at any time and would be eligible.
Why this matters
The file's own intro (lines 3–5) says it is a "running playbook" that should be read first when debugging, with an explicit instruction: "When you fix something not yet listed, add it here so the next session doesn't re-learn it." So §5.3 is a maintained operator-facing statement of current state, not historical context — the present-tense "now pins" framing is a live claim. The PR is itself a one-line behavior change motivated by exactly the -049 issue documented in §5.3, so updating that paragraph in this same PR keeps the playbook coherent.
Step-by-step proof
- Read
KLAUD_DEBUG.mdline 121: it sayslaunch_mi300x-amds.sh"now pins salloc to only known-good mi300x nodes (chi-mi300x-[034-036,054,057-058])". - Read the post-PR contents of
runners/launch_mi300x-amds.shline 14: thesallocinvocation uses--exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.comand no--nodelistflag at all. - Therefore, after merge, the playbook's claim about pinning to that specific set is false: the script does not pin to any set, and
-033/-035/-037(currentlydown*) would be allocated automatically if they come back up — which the PR description explicitly anticipates.
How to fix
A one-line edit in the same PR. Suggested rewording for §5.3:
Fix already landed:
runners/launch_mi300x-amds.shnow excludeschi-mi300x-049from salloc and lets Slurm pick any other healthy mi300x node — see PR #1462 (original pin) and PR #1532 (switch to exclude-only).chi-mi300x-049is held inState=DOWNby a watchdog on the controller (/home/gharunner/_audit/drain_049_watchdog.sh)…
The watchdog/drain explanation in the rest of the paragraph remains accurate and can stay as-is.
Severity
nit — internal-debug-doc drift, not a runtime defect. Non-blocking, but a near-zero-cost fix that keeps the playbook honest about the very script this PR is changing.
Summary
--nodelist=chi-mi300x-[034-036,054,057-058]...pin inrunners/launch_mi300x-amds.shand replaces it with--exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.com.down*but can come back at any time). Only -049 is permanently banned (persistent/nvme_homedisk-full).Test plan
sallocsucceeds on a node outside the old pin set.🤖 Generated with Claude Code