[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race#1510
Conversation
…weep race The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
LGTM — straightforward port of the existing flock-serialize-import pattern from launch_h100-cw.sh to fix a concurrent sweep race on shared NFS.
Extended reasoning...
Overview
This PR modifies a single CI/runner shell script (runners/launch_h100-dgxc-slurm.sh) to wrap an enroot import in a flock + unsquashfs -l skip-if-valid check, with a LOCK_FILE companion to the existing SQUASH_FILE path. The downstream srun, salloc args, and node exclusion logic are unchanged.
Security risks
None. This is a CI/infrastructure runner script — not production inference code, not auth/crypto, not user-facing. The lock file path is derived from $IMAGE via the same sanitizer already used for $SQUASH_FILE, so there's no new injection surface. The flock fd 9 redirection and the 600s timeout are standard patterns.
Level of scrutiny
Low. This is a mechanical port of an established pattern: launch_h100-cw.sh already runs the identical block (verified at lines 17-27 of that file), and the PR description cites multiple prior PRs (#1462/#1477/#1498) applying the same fix to the mi3xx launchers. The motivation (PR #1509 hitting 13/30 race failures with explicit error logs) is well-documented.
Other factors
The change is fully contained to the else branch (non-multinode path) of one runner script. bash -n syntax check is included in the test plan. No bugs were found by the bug hunting system. The only minor nit would be that the lock file is never cleaned up, but that's harmless (it's a zero-byte file on shared NFS and matches the existing pattern in the canonical launcher).
Summary
Adds the canonical
flock + unsquashfs-l skip-if-valid + enroot importpattern torunners/launch_h100-dgxc-slurm.sh. Matches whatlaunch_h100-cw.shand the mi3xx launchers (#1462/#1477/#1498) already do.Why
PR #1509 (qwen3.5-fp8-h100-sglang new recipe) failed 13/30 matrix jobs on the dgxc-slurm runners. All 13 hit:
…then sglang followed with
OSError: [Errno 116] Stale file handlefrom the partial sqsh once the next jobs tried to mount it.Root cause: concurrent matrix jobs all
srun enroot import -o <shared NFS path>without a lock. First one wins; the rest crash. The shared NFS storage (/mnt/nfs/lustre/containers/) makes the race deterministic at high sweep concurrency. The other h100 launcher variant (launch_h100-cw.sh) had this fix already; the dgxc-slurm variant didn't.Diff
Wraps the import in:
No other behavior change. The
--exclude=$SLURM_EXCLUDED_NODELIST, salloc args, and downstream srun stay the same.Test plan
bash -n runners/launch_h100-dgxc-slurm.shsyntax-checks.🤖 Generated with Claude Code