connection: add per-peer exponential reconnection backoff#951
Draft
Jolah1 wants to merge 2 commits into
Draft
Conversation
getting updates
Previously, the background reconnection task retried every persisted peer on a fixed 60s interval with no backoff, so an unreachable peer was retried indefinitely at the same cadence — log spam and wasted work. This became more visible after lightningdevkit#895 retained peers across force-closes so that channel_reestablish recovery can run. Track per-peer reconnect state in ConnectionManager: on failure, double the retry interval up to PEER_RECONNECTION_MAX_INTERVAL (30 min); on success (including user-initiated connects), clear the state so a subsequent drop retries promptly. The 60s tokio::time::interval is kept as the wakeup, gated per-peer by next_retry_at, since lightningdevkit#588's inline-sleep form does not generalize to N peers. Backoff state is in-memory and resets on restart — a fresh post-restart attempt is the correct behavior. State is also cleared when a peer is removed from the persisted store. Closes lightningdevkit#918.
|
👋 Hi! This PR is now in draft status. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #918.
Summary
Today the reconnection loop retries every disconnected peer on a fixed
PEER_RECONNECTION_INTERVAL(60s). For a peer that is persistently unreachable this means a reconnect attempt every minute indefinitely, which is wasteful and noisy. This adds per-peer exponential backoff so repeated failures are spaced out, while a successful connection (or a user-initiated connect) resets a peer back to the base interval.What this does
PeerReconnectStateinconnection.rs, tracked per peer in aMutex<HashMap<PublicKey, PeerReconnectState>>onConnectionManager.PEER_RECONNECTION_MAX_INTERVAL(30 min).is_reconnect_duerather than attempting every peer every tick.do_connect_peer), on explicitdisconnect(), and when a peer is removed after its last channel closes.reconnection also resets backoff.
Tests
reconnect_state_doubles_until_capped- verifies doubling and the cap.reconnect_state_schedules_relative_to_failure_time- verifies the next retry is scheduled relative to the failure instant.A note on "configurable"
The issue title says "configurable." This PR keeps the base/cap as
pub(crate)constants (matching the existing reconnection-interval constant) rather than exposing them onConfig. Happy to wire them intoConfigas a follow-up (or here) if reviewers prefer — wanted to keep the surface minimal first.Dependency / draft status
Marking this as draft: it builds on #895, which relocates last-channel
peer removal into the
event.rsChannelClosedhandler. Once #895 lands Iwill rebase and move the
clear_reconnect_statecall to sit alongside therelocated removal (and pick up the now-async
remove_peer.awaits from #919). Will un-draft after #895 is merged.