fix(opencode): bound SSE event backlogs and disconnect stalled consumers#31922
Open
MartinCajiao wants to merge 1 commit into
Open
fix(opencode): bound SSE event backlogs and disconnect stalled consumers#31922MartinCajiao wants to merge 1 commit into
MartinCajiao wants to merge 1 commit into
Conversation
Each event stream connection buffers events for SSE delivery in an unbounded queue, and the only cleanup signal is the response fiber ending. A TCP connection stuck in CLOSE_WAIT never surfaces an abort, so a half-dead subscriber kept its listener registered while its queue absorbed the full event firehose without limit - the SSE leak pattern reported in anomalyco#20695 (24.5GB RSS with zombie CLOSE_WAIT connections). The backlog is now bounded. On overflow the queue ends instead of dropping events silently: buffered events still flush, the response completes, finalizers unsubscribe the listener, and a live client reconnects and resyncs from the event log (sync events carry a total order). The instance endpoint also filters events before buffering so other instances events never occupy a subscriber backlog, and the global endpoint now registers its listener eagerly, matching the instance endpoint. The regression test holds an unconsumed connection and floods past capacity: it times out on the previous handlers and passes now. A companion test drains 12k events through the 10k backlog and stays connected, proving the bound applies to the undrained backlog, never to throughput.
Contributor
|
Thanks for your contribution! This PR doesn't have a linked issue. All PRs must reference an existing issue. Please:
See CONTRIBUTING.md for details. |
Contributor
|
Thanks for updating your PR! It now meets our contributing guidelines. 👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue for this PR
Closes #22198
Type of change
What does this PR do?
Bounds the per-subscriber SSE backlog and disconnects consumers that stop draining.
Both event endpoints buffer events in an unbounded queue per connection, and the only cleanup is the response fiber ending. A connection stuck in CLOSE_WAIT never delivers that signal, so the listener stays registered and the queue absorbs every event the server emits. This is the leak measured in #20695 (24.5GB RSS with zombie connections, #20695 (comment)).
The backlog is now bounded at 10k (sizing rationale in
backlog.ts). On overflow the queue ends rather than dropping events silently: buffered events still flush, the response completes, finalizers unsubscribe the listener, and a live client reconnects and resyncs from the event log. Disconnection is the safe policy because sync events carry a total order (src/sync/README.md), so a reconnect recovers cleanly, while silently dropped events would corrupt a live client's view. The instance endpoint now filters before buffering so other instances' events never occupy a subscriber's backlog, and the global endpoint registers its listener eagerly, matching the instance endpoint.How did you verify your code works?
A new regression test holds an unconsumed connection and floods past capacity: it times out on the previous handlers (the stream never terminates) and passes with this change. A companion test drains 12k events through the 10k backlog with batch pacing and stays connected, so the bound applies to the undrained backlog, not to throughput. The three existing event endpoint tests pass unchanged.
bun test test/server/httpapi-event.test.ts: 5 pass.tsgo --noEmitclean, oxlint 0 errors, prettier applied.Screenshots / recordings
Not a UI change.
Checklist