Skip to content

CI Error "OSError: [Errno 28] No space left on device" #8838

@ericspod

Description

@ericspod

Describe the bug
On occasion CI jobs will fail with this error, often caused by pip downloading huge CUDA packages. This is partly addressed with recent commits but needs to be fully fixed in conjunction with #8561.


Copilot output from one example fail:

The job is failing because the runner disk fills up while pip is trying to create an isolated build environment and pull in large CUDA-related wheels as build dependencies:

  • Failure point (from logs): during “Installing build dependencies” for pip install -e . (with BUILD_MONAI=1), pip errors with
    OSError: [Errno 28] No space left on device
  • The downloaded build deps include multiple large nvidia-*cu12 packages, which is what typically blows the GitHub-hosted runner disk.

Fix: stop pip from pulling CUDA build deps (use CPU-only build deps) and/or avoid build isolation during the compiled install

1) Ensure the compiled install uses --no-build-isolation

In .github/workflows/pythonapp.yml, the “Run compiled” step currently does:

- name: Run compiled (${{ runner.os }})
  run: |
    python -m pip uninstall -y monai
    BUILD_MONAI=1 python -m pip install -e .  # compile the cpp extensions
  shell: bash

Change it to:

- name: Run compiled (${{ runner.os }})
  run: |
    python -m pip uninstall -y monai
    # avoid creating an isolated env that may pull huge CUDA build deps
    BUILD_MONAI=1 python -m pip install --no-build-isolation -e .
  shell: bash

Why this helps: build isolation forces pip to resolve and install whatever is listed in pyproject.toml’s build-system requirements into a temporary env. If that includes (directly or indirectly) CUDA-enabled Torch dependencies, the runner can run out of space.

2) Pin the build dependency Torch to CPU wheels in pyproject.toml (recommended)

If MONAI’s pyproject.toml build-system requires torch (common for extension builds), ensure it resolves to CPU wheels on CI by either:

  • removing torch from build-system.requires if it’s not strictly necessary at build time, or
  • constraining it so it doesn’t drag CUDA packages.

Concretely, in pyproject.toml under:

[build-system]
requires = [ ... ]

Prefer a minimal set (example):

[build-system]
requires = ["setuptools", "wheel", "pybind11"]
build-backend = "setuptools.build_meta"

If torch must remain there for compilation logic, then the workflow should install the CPU torch first (you already do), and you should rely on --no-build-isolation (Fix #1) so the build uses that already-installed CPU torch instead of re-resolving a potentially CUDA variant.

3) Add a quick disk cleanup before the compiled install (optional, but useful)

You already clean /opt/hostedtoolcache in some jobs/steps; add similar cleanup right before compiling on Linux to maximize free space:

- if: runner.os == 'Linux'
  name: Free disk space (Linux only)
  run: |
    df -h
    sudo rm -rf /opt/hostedtoolcache/CodeQL || true
    sudo rm -rf /opt/hostedtoolcache/go || true
    sudo rm -rf /opt/hostedtoolcache/Java* || true
    sudo rm -rf /opt/hostedtoolcache/Ruby* || true
    df -h
  shell: bash

Minimal change that should unblock PRs

If you want the smallest, highest-signal fix: apply Fix #1 (add --no-build-isolation to the compiled install step). That directly targets the failing phase shown in the logs (“Installing build dependencies”) and typically prevents pip from downloading the large nvidia-*cu12 wheels that exhaust disk space.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions