Skip to content

fix(opencode): bound SSE event backlogs and disconnect stalled consumers#31922

Open
MartinCajiao wants to merge 1 commit into
anomalyco:devfrom
MartinCajiao:bounded-event-queues
Open

fix(opencode): bound SSE event backlogs and disconnect stalled consumers#31922
MartinCajiao wants to merge 1 commit into
anomalyco:devfrom
MartinCajiao:bounded-event-queues

Conversation

@MartinCajiao

@MartinCajiao MartinCajiao commented Jun 11, 2026

Copy link
Copy Markdown

Issue for this PR

Closes #22198

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Bounds the per-subscriber SSE backlog and disconnects consumers that stop draining.

Both event endpoints buffer events in an unbounded queue per connection, and the only cleanup is the response fiber ending. A connection stuck in CLOSE_WAIT never delivers that signal, so the listener stays registered and the queue absorbs every event the server emits. This is the leak measured in #20695 (24.5GB RSS with zombie connections, #20695 (comment)).

The backlog is now bounded at 10k (sizing rationale in backlog.ts). On overflow the queue ends rather than dropping events silently: buffered events still flush, the response completes, finalizers unsubscribe the listener, and a live client reconnects and resyncs from the event log. Disconnection is the safe policy because sync events carry a total order (src/sync/README.md), so a reconnect recovers cleanly, while silently dropped events would corrupt a live client's view. The instance endpoint now filters before buffering so other instances' events never occupy a subscriber's backlog, and the global endpoint registers its listener eagerly, matching the instance endpoint.

How did you verify your code works?

A new regression test holds an unconsumed connection and floods past capacity: it times out on the previous handlers (the stream never terminates) and passes with this change. A companion test drains 12k events through the 10k backlog with batch pacing and stays connected, so the bound applies to the undrained backlog, not to throughput. The three existing event endpoint tests pass unchanged. bun test test/server/httpapi-event.test.ts: 5 pass. tsgo --noEmit clean, oxlint 0 errors, prettier applied.

Screenshots / recordings

Not a UI change.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Each event stream connection buffers events for SSE delivery in an
unbounded queue, and the only cleanup signal is the response fiber
ending. A TCP connection stuck in CLOSE_WAIT never surfaces an abort,
so a half-dead subscriber kept its listener registered while its queue
absorbed the full event firehose without limit - the SSE leak pattern
reported in anomalyco#20695 (24.5GB RSS with zombie CLOSE_WAIT connections).

The backlog is now bounded. On overflow the queue ends instead of
dropping events silently: buffered events still flush, the response
completes, finalizers unsubscribe the listener, and a live client
reconnects and resyncs from the event log (sync events carry a total
order). The instance endpoint also filters events before buffering so
other instances events never occupy a subscriber backlog, and the
global endpoint now registers its listener eagerly, matching the
instance endpoint.

The regression test holds an unconsumed connection and floods past
capacity: it times out on the previous handlers and passes now. A
companion test drains 12k events through the 10k backlog and stays
connected, proving the bound applies to the undrained backlog, never
to throughput.
@github-actions github-actions Bot added needs:compliance This means the issue will auto-close after 2 hours. needs:issue labels Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@github-actions github-actions Bot removed needs:issue needs:compliance This means the issue will auto-close after 2 hours. labels Jun 11, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for updating your PR! It now meets our contributing guidelines. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leak: SSE connections stuck in CLOSE_WAIT cause unbounded AsyncQueue growth (~14 MB/sec)

1 participant