Skip to content

[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race#1510

Merged
functionstackx merged 1 commit into
mainfrom
fix-h100-dgxc-slurm-enroot-race
May 18, 2026
Merged

[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race#1510
functionstackx merged 1 commit into
mainfrom
fix-h100-dgxc-slurm-enroot-race

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Adds the canonical flock + unsquashfs-l skip-if-valid + enroot import pattern to runners/launch_h100-dgxc-slurm.sh. Matches what launch_h100-cw.sh and the mi3xx launchers (#1462/#1477/#1498) already do.

Why

PR #1509 (qwen3.5-fp8-h100-sglang new recipe) failed 13/30 matrix jobs on the dgxc-slurm runners. All 13 hit:

[ERROR] File already exists: /mnt/nfs/lustre/containers/lmsysorg_sglang_v0.5.12-cu130.sqsh
srun: error: hpc-gpu-1-12: task 0: Exited with exit code 1

…then sglang followed with OSError: [Errno 116] Stale file handle from the partial sqsh once the next jobs tried to mount it.

Root cause: concurrent matrix jobs all srun enroot import -o <shared NFS path> without a lock. First one wins; the rest crash. The shared NFS storage (/mnt/nfs/lustre/containers/) makes the race deterministic at high sweep concurrency. The other h100 launcher variant (launch_h100-cw.sh) had this fix already; the dgxc-slurm variant didn't.

Diff

Wraps the import in:

exec 9>"$LOCK_FILE"
flock -w 600 9 || { echo 'Failed to acquire lock for $SQUASH_FILE'; exit 1; }
if unsquashfs -l "$SQUASH_FILE" > /dev/null 2>&1; then
    echo 'Squash file already exists and is valid, skipping import'
else
    rm -f "$SQUASH_FILE"
    enroot import -o "$SQUASH_FILE" docker://$IMAGE
fi

No other behavior change. The --exclude=$SLURM_EXCLUDED_NODELIST, salloc args, and downstream srun stay the same.

Test plan

🤖 Generated with Claude Code

…weep race

The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE`
without any locking, so when multiple sweep jobs landed on the cluster
simultaneously they all tried to import the same image into the shared
NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the
rest crash with `[ERROR] File already exists: ...sqsh` and
`OSError: [Errno 116] Stale file handle` (from the partial sqsh) once
sglang/vllm tries to start.

Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs
failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure
rate scales with sweep concurrency — was masked previously because
older H100 recipes had fewer matrix points sharing the cluster.

Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid +
enroot import` pattern already used in launch_h100-cw.sh, plus the
mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx requested a review from a team May 18, 2026 19:19
@functionstackx functionstackx merged commit c08baba into main May 18, 2026
@functionstackx functionstackx deleted the fix-h100-dgxc-slurm-enroot-race branch May 18, 2026 19:19
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward port of the existing flock-serialize-import pattern from launch_h100-cw.sh to fix a concurrent sweep race on shared NFS.

Extended reasoning...

Overview

This PR modifies a single CI/runner shell script (runners/launch_h100-dgxc-slurm.sh) to wrap an enroot import in a flock + unsquashfs -l skip-if-valid check, with a LOCK_FILE companion to the existing SQUASH_FILE path. The downstream srun, salloc args, and node exclusion logic are unchanged.

Security risks

None. This is a CI/infrastructure runner script — not production inference code, not auth/crypto, not user-facing. The lock file path is derived from $IMAGE via the same sanitizer already used for $SQUASH_FILE, so there's no new injection surface. The flock fd 9 redirection and the 600s timeout are standard patterns.

Level of scrutiny

Low. This is a mechanical port of an established pattern: launch_h100-cw.sh already runs the identical block (verified at lines 17-27 of that file), and the PR description cites multiple prior PRs (#1462/#1477/#1498) applying the same fix to the mi3xx launchers. The motivation (PR #1509 hitting 13/30 race failures with explicit error logs) is well-documented.

Other factors

The change is fully contained to the else branch (non-multinode path) of one runner script. bash -n syntax check is included in the test plan. No bugs were found by the bug hunting system. The only minor nit would be that the lock file is never cleaned up, but that's harmless (it's a zero-byte file on shared NFS and matches the existing pattern in the canonical launcher).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant