Skip to content

[Klaud Cold] mi300x runner: switch --nodelist pin to --exclude -049#1532

Merged
functionstackx merged 1 commit into
mainfrom
klaud-cold/mi300x-runner-exclude-only
May 20, 2026
Merged

[Klaud Cold] mi300x runner: switch --nodelist pin to --exclude -049#1532
functionstackx merged 1 commit into
mainfrom
klaud-cold/mi300x-runner-exclude-only

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

  • Removes the hardcoded --nodelist=chi-mi300x-[034-036,054,057-058]... pin in runners/launch_mi300x-amds.sh and replaces it with --exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.com.
  • Lets Slurm allocate on any healthy mi300x node (-033/-035/-037 are currently down* but can come back at any time). Only -049 is permanently banned (persistent /nvme_home disk-full).
  • No infra changes required; effective on next sweep run.

Test plan

🤖 Generated with Claude Code

The previous --nodelist pin only allowed allocations on a hand-picked subset
(034-036, 054, 057-058) and blocked nodes that have since recovered (033,
035, 037). Switch to --exclude=chi-mi300x-049 so Slurm can pick any healthy
node; -049 stays banned (persistent /nvme_home disk-full).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx requested a review from a team May 20, 2026 19:39
@functionstackx functionstackx enabled auto-merge (squash) May 20, 2026 19:39
@functionstackx functionstackx merged commit 1c4c6bd into main May 20, 2026
4 checks passed
@functionstackx functionstackx deleted the klaud-cold/mi300x-runner-exclude-only branch May 20, 2026 19:39
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward one-line Slurm allocation tweak that swaps a node pin for an exclude.

Extended reasoning...

Overview

This PR makes a single-line change to runners/launch_mi300x-amds.sh: it replaces --nodelist=chi-mi300x-[034-036,054,057-058]... with --exclude=chi-mi300x-049... in the salloc invocation, and updates the adjacent comment to match. The motivation is operational: only -049 is permanently bad (/nvme_home disk-full, held State=DOWN by a watchdog), while -033/-035/-037 are transient and should be allowed back into the pool when they recover.

Security risks

None. This is a Slurm allocation flag in a runner launch script — no auth, crypto, permissions, or user-input handling is touched. The change widens the candidate node set within an already-trusted internal cluster partition.

Level of scrutiny

Very low. This is a runner/ops config tweak, not a logic or data-path change. The exclude semantics are well-documented in Slurm and the rest of the script (squash import, srun, scancel) is untouched. If the swap is wrong, the failure mode is loud (salloc fails or a job lands on a bad node) and trivially revertable.

Other factors

The only finding from the bug hunting pass is a documentation-drift nit in KLAUD_DEBUG.md §5.3, which is already attached as an inline comment for the author to consider. That's an internal playbook update, not a runtime defect, so it doesn't block approval of the underlying behavior change.

Comment on lines +12 to +14
# Exclude known-bad nodes; let Slurm pick from anything else:
# chi-mi300x-049: drained (persistent /nvme_home disk-full)
JOB_ID=$(salloc --partition=$PARTITION --exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Documentation drift: KLAUD_DEBUG.md §5.3 (line 121) still describes runners/launch_mi300x-amds.sh as pinning salloc to chi-mi300x-[034-036,054,057-058] via --nodelist, but after this PR the script uses --exclude=chi-mi300x-049 only. Consider updating that paragraph in the same PR so the operator playbook matches the new exclude-only strategy.

Extended reasoning...

What the drift is

KLAUD_DEBUG.md §5.3 ("chi-mi300x-049/nvme_home disk-full") at line 121 currently reads:

Fix already landed: runners/launch_mi300x-amds.sh now pins salloc to only known-good mi300x nodes (chi-mi300x-[034-036,054,057-058]) — see PR #1462.

This PR replaces the --nodelist=chi-mi300x-[034-036,054,057-058]... pin with --exclude=chi-mi300x-049.... Two specific claims in §5.3 become factually wrong on merge:

  1. The script no longer "pins salloc to only known-good mi300x nodes" — it allows Slurm to pick any node except -049.
  2. The enumerated set chi-mi300x-[034-036,054,057-058] is no longer the allowed set; per the PR description, -033/-035/-037 may rejoin at any time and would be eligible.

Why this matters

The file's own intro (lines 3–5) says it is a "running playbook" that should be read first when debugging, with an explicit instruction: "When you fix something not yet listed, add it here so the next session doesn't re-learn it." So §5.3 is a maintained operator-facing statement of current state, not historical context — the present-tense "now pins" framing is a live claim. The PR is itself a one-line behavior change motivated by exactly the -049 issue documented in §5.3, so updating that paragraph in this same PR keeps the playbook coherent.

Step-by-step proof

  1. Read KLAUD_DEBUG.md line 121: it says launch_mi300x-amds.sh "now pins salloc to only known-good mi300x nodes (chi-mi300x-[034-036,054,057-058])".
  2. Read the post-PR contents of runners/launch_mi300x-amds.sh line 14: the salloc invocation uses --exclude=chi-mi300x-049.ord.vultr.cpe.ice.amd.com and no --nodelist flag at all.
  3. Therefore, after merge, the playbook's claim about pinning to that specific set is false: the script does not pin to any set, and -033/-035/-037 (currently down*) would be allocated automatically if they come back up — which the PR description explicitly anticipates.

How to fix

A one-line edit in the same PR. Suggested rewording for §5.3:

Fix already landed: runners/launch_mi300x-amds.sh now excludes chi-mi300x-049 from salloc and lets Slurm pick any other healthy mi300x node — see PR #1462 (original pin) and PR #1532 (switch to exclude-only). chi-mi300x-049 is held in State=DOWN by a watchdog on the controller (/home/gharunner/_audit/drain_049_watchdog.sh)…

The watchdog/drain explanation in the rest of the paragraph remains accurate and can stay as-is.

Severity

nit — internal-debug-doc drift, not a runtime defect. Non-blocking, but a near-zero-cost fix that keeps the playbook honest about the very script this PR is changing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant