fix: stop genservers gracefully#319
Conversation
Greptile SummaryThis PR adds graceful shutdown to the ethlambda node by stopping the
Confidence Score: 3/5Safe to merge for typical usage, but a race condition in the shutdown signalling path can cause an indefinite hang. One P1 finding (notify_waiters race that silently drops shutdown signals and causes indefinite hangs) and one P2 (no shutdown timeout). The P1 drives the score below the 4/5 ceiling. bin/ethlambda/src/main.rs — specifically the notify_waiters() call and the unbounded join awaits in the shutdown sequence.
|
| Filename | Overview |
|---|---|
| bin/ethlambda/src/main.rs | Adds graceful shutdown via Arc and actor context().stop(); the notify_waiters() call can silently drop notifications if spawned tasks haven't registered their waiters yet, causing an indefinite hang on shutdown. |
| crates/net/rpc/src/lib.rs | Adds a generic shutdown future parameter to both server functions and wires it into axum's with_graceful_shutdown(); clean and idiomatic change with no issues. |
Sequence Diagram
sequenceDiagram
participant Main
participant BlockChain
participant P2P
participant ApiServer
participant MetricsServer
Main->>BlockChain: spawn actor
Main->>P2P: spawn actor
Main->>ApiServer: tokio::spawn (with shutdown_future)
Main->>MetricsServer: tokio::spawn (with shutdown_future)
Main->>Main: ctrl_c().await
Main->>BlockChain: context().stop()
Main->>P2P: context().stop()
Main->>ApiServer: shutdown_notify.notify_waiters()
Main->>MetricsServer: shutdown_notify.notify_waiters()
Main->>BlockChain: join().await
BlockChain-->>Main: done
Main->>P2P: join().await
P2P-->>Main: done
Main->>ApiServer: api_handle.await
ApiServer-->>Main: done
Main->>MetricsServer: metrics_handle.await
MetricsServer-->>Main: done
Main->>Main: info!("Shutdown complete")
Prompt To Fix All With AI
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 238
Comment:
**`notify_waiters()` can silently drop shutdown signals**
`Notify::notify_waiters()` only wakes futures that are *currently registered* as waiters at the moment of the call — it does not store a permit for futures that haven't been polled yet. The shutdown futures inside the spawned tasks (`async move { X.notified().await }`) don't register themselves until axum polls the shutdown future for the first time, which happens asynchronously after `tokio::spawn`. If `ctrl_c` fires before the spawned tasks have had a chance to run, both notifications are silently dropped and `api_handle.await` / `metrics_handle.await` will block indefinitely — the process never exits.
The clearest fix is two separate `Arc<Notify>` handles each called with `notify_one()`, which stores a permit that survives until consumed:
```rust
let api_shutdown_notify = Arc::new(Notify::new());
let metrics_shutdown_notify = Arc::new(Notify::new());
// ... pass clones into spawned tasks ...
api_shutdown_notify.notify_one();
metrics_shutdown_notify.notify_one();
```
Alternatively, a `tokio::sync::watch` channel with a stored value avoids this race entirely for any number of listeners.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: bin/ethlambda/src/main.rs
Line: 240-243
Comment:
**No timeout on shutdown awaits**
All four `await` calls — `blockchain_ref.join()`, `p2p_ref.join()`, `api_handle.await`, and `metrics_handle.await` — are unbounded. If any actor gets stuck in a long-running handler or the HTTP servers hold open long-lived connections, the process will hang indefinitely. Consider wrapping the shutdown sequence in `tokio::time::timeout` so the node can force-exit after a reasonable deadline (e.g., 30 s) even if something stalls.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: stop genservers gracefully" | Re-trigger Greptile
MegaRedHand
left a comment
There was a problem hiding this comment.
Hey @dicethedev! Thank you for your contribution! I left you some comments.
Also, remember to sign your commits, as explained in our CONTRIBUTING.md
e22256b to
1c20bdf
Compare
1c20bdf to
28a7f6f
Compare
| // Handle multiple Ctrl+C signals: | ||
| // 1st Ctrl+C: graceful shutdown | ||
| // 2nd Ctrl+C: force exit immediately | ||
| tokio::spawn(async move { | ||
| loop { | ||
| tokio::signal::ctrl_c().await.ok(); | ||
| let count = signal_count_clone.fetch_add(1, Ordering::SeqCst) + 1; | ||
| if count == 1 { | ||
| info!("Shutdown signal received, stopping actors and servers..."); | ||
| } else { | ||
| warn!("Force shutdown requested, exiting immediately"); | ||
| std::process::exit(1); | ||
| } | ||
| } | ||
| }); | ||
|
|
||
| // Wait for first signal | ||
| loop { | ||
| if signal_count.load(Ordering::SeqCst) > 0 { | ||
| break; | ||
| } | ||
| tokio::time::sleep(tokio::time::Duration::from_millis(100)).await; | ||
| } |
There was a problem hiding this comment.
Let's remove the atomic and busy-loop:
// 1st ctrl+c: start graceful shutdown
tokio::signal::ctrl_c().await.ok();
info!("Shutdown signal received, stopping actors and servers...");
tokio::spawn(async move {
let ctrlcs_to_forced_shutdown = 3;
// This can be turned into a loop
tokio::signal::ctrl_c().await.ok();
warn!("Graceful shutdown in progress. Press ctrl+C 2 more times to force ungraceful shutdown");
tokio::signal::ctrl_c().await.ok();
warn!("Graceful shutdown in progress. Press ctrl+C 1 more times to force ungraceful shutdown");
tokio::signal::ctrl_c().await.ok();
info!("Forced ungraceful shutdown...");
std::process::exit(1);
})
let blockchain_ref = blockchain.actor_ref().clone();
// etc.
MegaRedHand
left a comment
There was a problem hiding this comment.
Left some more comments
| address: SocketAddr, | ||
| store: Store, | ||
| aggregator: AggregatorController, | ||
| shutdown: tokio_util::sync::CancellationToken, |
There was a problem hiding this comment.
Let's import this at the top of the file:
| shutdown: tokio_util::sync::CancellationToken, | |
| shutdown: CancellationToken, |
| pub async fn start_metrics_server(address: SocketAddr) -> Result<(), std::io::Error> { | ||
| pub async fn start_metrics_server( | ||
| address: SocketAddr, | ||
| shutdown: tokio_util::sync::CancellationToken, |
🗒️ Description
This PR implements graceful shutdown for the ethlambda node's actors and HTTP servers. Previously, the node would abruptly terminate on Ctrl+C without waiting for in-flight operations to complete.
Now, on shutdown signal:
stop()methodwith_graceful_shutdown()These changes ensure that:
This prevents data corruption and improves reliability.
🔗 Related Issues or PRs
Closes #195