[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race by functionstackx · Pull Request #1510 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T19:19:53Z

Summary

Adds the canonical flock + unsquashfs-l skip-if-valid + enroot import pattern to runners/launch_h100-dgxc-slurm.sh. Matches what launch_h100-cw.sh and the mi3xx launchers (#1462/#1477/#1498) already do.

Why

PR #1509 (qwen3.5-fp8-h100-sglang new recipe) failed 13/30 matrix jobs on the dgxc-slurm runners. All 13 hit:

[ERROR] File already exists: /mnt/nfs/lustre/containers/lmsysorg_sglang_v0.5.12-cu130.sqsh
srun: error: hpc-gpu-1-12: task 0: Exited with exit code 1

…then sglang followed with OSError: [Errno 116] Stale file handle from the partial sqsh once the next jobs tried to mount it.

Root cause: concurrent matrix jobs all srun enroot import -o <shared NFS path> without a lock. First one wins; the rest crash. The shared NFS storage (/mnt/nfs/lustre/containers/) makes the race deterministic at high sweep concurrency. The other h100 launcher variant (launch_h100-cw.sh) had this fix already; the dgxc-slurm variant didn't.

Diff

Wraps the import in:

exec 9>"$LOCK_FILE"
flock -w 600 9 || { echo 'Failed to acquire lock for $SQUASH_FILE'; exit 1; }
if unsquashfs -l "$SQUASH_FILE" > /dev/null 2>&1; then
    echo 'Squash file already exists and is valid, skipping import'
else
    rm -f "$SQUASH_FILE"
    enroot import -o "$SQUASH_FILE" docker://$IMAGE
fi

No other behavior change. The --exclude=$SLURM_EXCLUDED_NODELIST, salloc args, and downstream srun stay the same.

Test plan

bash -n runners/launch_h100-dgxc-slurm.sh syntax-checks.
After merge: rebased [Klaud Cold] Add qwen3.5-fp8-h100-sglang (off + mtp) recipes #1509 sweep re-runs without the import race.

🤖 Generated with Claude Code

…weep race The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

LGTM — straightforward port of the existing flock-serialize-import pattern from launch_h100-cw.sh to fix a concurrent sweep race on shared NFS.

Extended reasoning...

Overview

This PR modifies a single CI/runner shell script (runners/launch_h100-dgxc-slurm.sh) to wrap an enroot import in a flock + unsquashfs -l skip-if-valid check, with a LOCK_FILE companion to the existing SQUASH_FILE path. The downstream srun, salloc args, and node exclusion logic are unchanged.

Security risks

None. This is a CI/infrastructure runner script — not production inference code, not auth/crypto, not user-facing. The lock file path is derived from $IMAGE via the same sanitizer already used for $SQUASH_FILE, so there's no new injection surface. The flock fd 9 redirection and the 600s timeout are standard patterns.

Level of scrutiny

Low. This is a mechanical port of an established pattern: launch_h100-cw.sh already runs the identical block (verified at lines 17-27 of that file), and the PR description cites multiple prior PRs (#1462/#1477/#1498) applying the same fix to the mi3xx launchers. The motivation (PR #1509 hitting 13/30 race failures with explicit error logs) is well-documented.

Other factors

The change is fully contained to the else branch (non-multinode path) of one runner script. bash -n syntax check is included in the test plan. No bugs were found by the bug hunting system. The only minor nit would be that the lock file is never cleaned up, but that's harmless (it's a zero-byte file on shared NFS and matches the existing pattern in the canonical launcher).

functionstackx requested a review from a team May 18, 2026 19:19

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

functionstackx merged commit c08baba into main May 18, 2026

functionstackx deleted the fix-h100-dgxc-slurm-enroot-race branch May 18, 2026 19:19

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

claude Bot reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race#1510

[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race#1510
functionstackx merged 1 commit into
mainfrom
fix-h100-dgxc-slurm-enroot-race

functionstackx commented May 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Why

Diff

Test plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant