Skip to content

fix: release shells mutex before spawn_subshell to prevent deadlock#342

Closed
ibash-corpusant wants to merge 1 commit intoshell-pool:masterfrom
ibash-corpusant:fix/shells-mutex-deadlock
Closed

fix: release shells mutex before spawn_subshell to prevent deadlock#342
ibash-corpusant wants to merge 1 commit intoshell-pool:masterfrom
ibash-corpusant:fix/shells-mutex-deadlock

Conversation

@ibash-corpusant
Copy link
Copy Markdown

Problem

select_shell_desc() holds the global shells mutex while calling spawn_subshell(), which calls wait_for_startup(). If wait_for_startup blocks (e.g. due to a version mismatch between client and daemon, or a shell that fails to start), the mutex is held forever, deadlocking the entire daemon. All operations (list, attach, detach, kill) that need the shells mutex hang indefinitely.

This was observed in production when a daemon that had been running for 2+ days encountered a client/daemon binary mismatch after a package update. The wait_for_startup sentinel handshake hung, and from that point on, every shpool list, shpool attach, etc. blocked forever.

Fix

Refactor select_shell_desc() into three phases:

Phase Lock held? What it does
Phase 1 ✅ Yes Determine whether to create a new session or reuse existing
Phase 2 ❌ No Call spawn_subshell() (which may block in wait_for_startup)
Phase 3 ✅ Yes (briefly) Insert session into table, return Arc references

The critical change: spawn_subshell() is now called after the shells mutex is released. If wait_for_startup blocks, only the attach thread is affected — list, kill, detach, and other attach requests proceed normally.

Also adds a 30-second timeout to wait_for_startup() as defense-in-depth. Previously it was an infinite loop.

Regression Test

New test (deadlock.rs) that:

  1. Starts a daemon with a fake shell that never produces the startup sentinel
  2. Spawns an attach (which hangs in wait_for_startup)
  3. Asserts that shpool list still completes within 5 seconds

Before fix: DEADLOCK DETECTED — list times out after 5s
After fix: list returns immediately, test passes in 3.2s

Test Results

All existing tests pass (51 passed, 1 pre-existing prompt_prefix_fish failure unrelated to this change).

select_shell_desc() previously held the global 'shells' mutex while
calling spawn_subshell(), which calls wait_for_startup(). If
wait_for_startup blocks (e.g. due to a version mismatch between client
and daemon, or a shell that fails to start), the mutex is held forever,
deadlocking the entire daemon. All operations (list, attach, detach,
kill) that need the shells mutex will hang indefinitely.

Fix: Refactor select_shell_desc() into three phases:
1. Hold the lock to determine whether to create/reuse a session
2. Release the lock, then call spawn_subshell (may block)
3. Re-acquire the lock briefly to insert the session and return refs

Also adds a 30-second timeout to wait_for_startup() as defense-in-depth,
and a regression test that proves list does not deadlock during a slow
shell spawn.
@google-cla
Copy link
Copy Markdown

google-cla Bot commented Apr 21, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@ethanpailes
Copy link
Copy Markdown
Contributor

Please explain the extent to which you used AI to generate this PR. It will help me know how closely I have to read it while reviewing.

@ibash-corpusant
Copy link
Copy Markdown
Author

I frequently encountered error not being able to reconnect to shpool sessions on my cloudtop. I had smith grab the logs, investigate the code, and then write a test for this case. I also had smith write the fix. So everything is ai generated.

@ethanpailes
Copy link
Copy Markdown
Contributor

Thanks! It will probably take me a minute to review.

@ethanpailes
Copy link
Copy Markdown
Contributor

This patch is pretty big and from the part that I've looked at in depth (the regression test) does not really match the project style that well. I reworked the test in https://github.com/shell-pool/shpool/pull/new/shell-spawn-deadlock-fix and I'm going to see if there is a fix that does not purturb the code quite as much.

ethanpailes added a commit that referenced this pull request Apr 24, 2026
This patch fixes a deadlock that could occur if the shell blocks
forever on startup.

No actual fix yet, just a regression test.

See #342 for context.
@ethanpailes
Copy link
Copy Markdown
Contributor

I just pushed a fix. I think I also want to have wait_for_startup clean itself up after a while though so we don't leak threads.

ethanpailes added a commit that referenced this pull request Apr 24, 2026
This patch fixes a deadlock that could occur if the shell blocks
forever on startup. To fix, we only grab the global shells lock
when looking for an existing session and then grab it again when
we modify or examine the global shells table, but critically we
are not holding the lock while spawning the shell for the new session.

This prevents a shell that never finishes starting up from
locking up the whole daemon by retaining ownership of the global
shells table lock.

See #342 for context.
ethanpailes added a commit that referenced this pull request Apr 24, 2026
This patch fixes a deadlock that could occur if the shell blocks
forever on startup. To fix, we only grab the global shells lock
when looking for an existing session and then grab it again when
we modify or examine the global shells table, but critically we
are not holding the lock while spawning the shell for the new session.

This prevents a shell that never finishes starting up from
locking up the whole daemon by retaining ownership of the global
shells table lock.

See #342 for context.
@ethanpailes
Copy link
Copy Markdown
Contributor

Now that I dug into it, the application code changes look better than the test by quite a bit, though there is still a fair bit of agent-changing-things-for-no-clear-reason that I see. There is one issue I noticed, which is that the changes to prompt.rs to wait for a timeout only check the timeout every time the loop ticks, which is not quite right because read() could potentially block forever. To properly implement timeouts on a blocking fd, we need to gate the read() call behind poll() so that it always succeeds. That way the loop is usually parked on poll, which wakes at standard intervals.

I re-wrote this as #348 to remove some of the agent bloat.

ethanpailes added a commit that referenced this pull request Apr 24, 2026
This patch fixes a deadlock that could occur if the shell blocks
forever on startup. To fix, we only grab the global shells lock
when looking for an existing session and then grab it again when
we modify or examine the global shells table, but critically we
are not holding the lock while spawning the shell for the new session.

This prevents a shell that never finishes starting up from
locking up the whole daemon by retaining ownership of the global
shells table lock.

See #342 for context.
ethanpailes added a commit that referenced this pull request Apr 24, 2026
This patch fixes a deadlock that could occur if the shell blocks
forever on startup. To fix, we only grab the global shells lock
when looking for an existing session and then grab it again when
we modify or examine the global shells table, but critically we
are not holding the lock while spawning the shell for the new session.

This prevents a shell that never finishes starting up from
locking up the whole daemon by retaining ownership of the global
shells table lock.

See #342 for context.
ethanpailes added a commit that referenced this pull request Apr 24, 2026
This patch fixes a deadlock that could occur if the shell blocks
forever on startup. To fix, we only grab the global shells lock
when looking for an existing session and then grab it again when
we modify or examine the global shells table, but critically we
are not holding the lock while spawning the shell for the new session.

This prevents a shell that never finishes starting up from
locking up the whole daemon by retaining ownership of the global
shells table lock.

See #342 for context.
@ethanpailes
Copy link
Copy Markdown
Contributor

I merged the rewritten patch. Sorry, I know re-writing your patch is annoying. I think I need to update the AI policy to be more unfront that I might do that if a patch feels AI generated, I'm still adapting to this new agentic world.

I really do appreciate this change though! The reproducible test was super valuable!

@ibash-corpusant
Copy link
Copy Markdown
Author

yay thank you! and I don't mind the rewrite, just happy it's fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants