feat: Expose prediction SSE streams by markphelps · Pull Request #3019 · replicate/cog

markphelps · 2026-05-14T16:32:56Z

Summary

Reuse POST /predictions and PUT /predictions/{id} for Server-Sent Events by returning an SSE stream when requests send Accept: text/event-stream.
Preserve existing non-SSE behavior: sync prediction requests still return the blocking JSON response, and Prefer: respond-async without SSE still returns 202 JSON.

CLI behavior

This PR does not add a cog predict --stream option or otherwise change the predict CLI.
Predictions made through the Cog CLI continue to default to the existing non-streaming behavior.

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

This PR adds Server-Sent Events (SSE) streaming for predictions with event replay for late subscribers. The implementation looks solid overall with good test coverage. I found one potential resource concern that should be addressed:

Issues

1. Unbounded memory growth in stream_history (Medium)

The stream_history Vec in Prediction grows unbounded during prediction execution. For long-running predictions with frequent output chunks, logs, or metrics, this could cause memory exhaustion. Consider adding a limit to the history size or making it configurable.

ask-bonk · 2026-05-14T16:35:16Z

I've completed my review of PR #3019. The PR adds SSE streaming for predictions with event replay functionality. The implementation is well-structured and includes comprehensive tests.

Summary of findings:

I identified one medium-severity issue regarding unbounded memory growth in the stream_history Vec. For long-running predictions that generate many output chunks, logs, or metrics, this could lead to memory exhaustion. I've posted a review comment suggesting either adding a configurable limit to the history size or implementing a mechanism to drop old events when a threshold is exceeded.

The rest of the implementation looks solid:

Good use of tokio::sync::broadcast for the streaming channel
Proper handling of late subscribers via replay
Correct cancellation semantics for sync vs async predictions
Comprehensive test coverage including integration tests
Clean SSE stream implementation with proper event formatting

github run

ask-bonk · 2026-05-14T17:02:45Z

Based on my review of PR #3019, I agree with the previous reviewer's finding about unbounded memory growth in stream_history. However, since that issue was already flagged, I have no additional actionable issues to report.

The implementation is well-structured:

Clean SSE stream implementation with proper event replay for late subscribers
Correct handling of sync vs async prediction cancellation semantics
Comprehensive test coverage including integration tests
Good documentation with working examples
Proper error handling for broadcast channel lag cases

LGTM

github run

ask-bonk · 2026-05-14T19:40:29Z

Based on my thorough review of PR #3019, I have analyzed all the changed files:

Summary of changes:
This PR adds Server-Sent Events (SSE) streaming for predictions with event replay for late subscribers. It exposes GET /predictions/{id}/stream as an SSE endpoint that streams lifecycle events (start, output, log, metric, completed) in real-time.

Previous findings:
The unbounded memory growth issue in stream_history was already flagged by the previous reviewer. This is a legitimate concern where long-running predictions with frequent output chunks could cause memory exhaustion.

My independent assessment:
I've verified the following aspects of the implementation:

Broadcast channel - Uses a 1024-slot channel with proper lag handling (logs warning and continues)
Stream cancellation semantics - Correctly distinguishes between sync (cancel on disconnect) and async (no cancel) predictions via PredictionStreamGuard::drop()
Drop order - Verified that field drop order in StreamState ensures the receiver is still alive when the guard's Drop runs
Event replay - Properly replays history for late subscribers before subscribing to live events
Keep-alive - SSE stream includes 15-second keep-alive messages
Integration test - Comprehensive test covering the full SSE flow
Example - Working example with Hugging Face transformers model

No additional actionable issues beyond what was already flagged.

LGTM

github run

ask-bonk · 2026-05-14T20:00:18Z

After reviewing PR #3019, I've analyzed the changes to add Server-Sent Events (SSE) streaming for predictions. The implementation reuses POST /predictions and PUT /predictions/{id} with content negotiation via Accept: text/event-stream, removing the separate GET /predictions/{id}/stream route.

Key aspects reviewed:

Clean SSE stream implementation with 15s keep-alive and proper event formatting
Correct cancellation semantics distinguishing sync vs async predictions via PredictionStreamGuard
Event replay for late subscribers using stream_history Vec
Comprehensive test coverage including unit tests and integration tests
Good documentation with working examples
Preserved backward compatibility for non-SSE requests

Previously flagged issue:
The unbounded memory growth in stream_history Vec was already identified by a previous reviewer. This is a legitimate concern for long-running predictions with frequent events.

No additional actionable issues found beyond what was already flagged.

LGTM

github run

ask-bonk · 2026-05-14T20:05:30Z

Based on my thorough review of PR #3019, I can see this PR adds Server-Sent Events (SSE) streaming for predictions by reusing POST /predictions and PUT /predictions/{id} with content negotiation via Accept: text/event-stream.

Key changes reviewed:

prediction.rs: Bounded stream history (1024 events) with proper event replay for late subscribers
service.rs: PredictionStreamGuard correctly distinguishes sync vs async prediction cancellation semantics
routes.rs: Clean SSE stream implementation with 15s keep-alive, content negotiation, removed separate /stream route
Tests: Comprehensive unit tests and integration test coverage

Previously flagged issue (addressed):
The unbounded memory growth concern in stream_history has been resolved with MAX_STREAM_HISTORY_EVENTS: usize = 1024 (line 13), and the emit_stream_event method properly removes old events when the limit is reached (lines 195-201).

Additional verification:

Terminal state guards prevent overwriting terminal states
Drop order in StreamState ensures receiver is alive when guard's Drop runs
Sync predictions cancel on disconnect; async predictions do not
Backward compatibility preserved for non-SSE requests

LGTM

github run

ask-bonk · 2026-05-14T21:00:53Z

Based on my review of PR #3019, I can confirm that the previously flagged issue regarding unbounded memory growth in stream_history has been addressed:

Verified fixes:

MAX_STREAM_HISTORY_EVENTS: usize = 1024 is defined at line 13 in prediction.rs
The emit_stream_event method (lines 199-206) correctly removes old events when the limit is reached and tracks skipped events

Additional verification:

Bounded replay test exists at line 727 in prediction.rs
Terminal state guards prevent overwriting terminal states (lines 230-231, 249-250, 262-263)
PredictionStreamGuard::drop() correctly distinguishes sync vs async cancellation (lines 141-153 in service.rs)
Proper drop order in StreamState ensures receiver stays alive during guard drop

The implementation is well-structured with comprehensive test coverage including unit tests for lagged streams, truncated replay, and integration tests.

LGTM

github run

michaeldwan

Solid design. Main point discussed offline: streaming needs to be opt-in at the handler level via a decorator/attribute. Five concrete changes in the comments.

michaeldwan · 2026-05-15T20:43:29Z

+
+    fn emit_stream_event(&mut self, event: PredictionStreamEvent) {
+        if self.stream_history.len() == MAX_STREAM_HISTORY_EVENTS {
+            self.stream_history.remove(0);


Vec::remove(0) is O(n) on every event after the buffer fills -- thousands of shifts per prediction for token-by-token LLM output. VecDeque::pop_front() is O(1).

michaeldwan · 2026-05-15T20:43:29Z

+            prediction_id.clone(),
+            input.clone(),
+            webhook_sender,
+            response_mode != PredictionResponseMode::AsyncJson,


response_mode != PredictionResponseMode::AsyncJson is true for SyncJson too. Should be response_mode == PredictionResponseMode::AsyncSse -- the current expression is inert for sync mode but reads as a bug.

michaeldwan · 2026-05-15T20:43:29Z


 impl Prediction {
    pub fn new(id: String, webhook: Option<WebhookSender>) -> Self {
+        let (stream_tx, _) = tokio::sync::broadcast::channel(1024);


Channel capacity and history cap are both 1024 but defined independently. Use a shared constant so they can't drift.

michaeldwan · 2026-05-15T20:43:29Z

+            return;
+        }
+
+        if self.service.stream_receiver_count(&self.id) == 0


The async cleanup task calls remove_prediction while the SSE client may still be draining. After removal, stream_receiver_count returns unwrap_or(0) and prediction_is_terminal returns unwrap_or(true) -- which happen to be the safe defaults. That invariant needs a comment, or hold an Arc to the entry in the guard so it doesn't depend on post-removal lookup.

michaeldwan · 2026-05-15T20:43:29Z

+cog serve --upload-url http://unused/
+
+curl -H Accept:text/event-stream PUT /predictions/sse-stream-test '{"id":"sse-stream-test","input":{}}'
+stdout 'event: output'


Missing stdout 'event: start' -- should verify the full lifecycle.

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

This PR adds Server-Sent Events (SSE) streaming for predictions with event replay for late subscribers. The implementation is well-structured overall.

Summary

I've verified the issues flagged by previous reviewers:

Vec::remove(0) O(n) issue - Fixed: Now uses VecDeque::pop_front() which is O(1)
Unbounded memory growth - Fixed: STREAM_EVENT_BUFFER_CAPACITY (1024) limits both broadcast channel and history
Integration test coverage - The test at integration-tests/tests/sse_streaming_output.txtar includes stdout 'event: start' as suggested

The implementation correctly:

Distinguishes sync vs async cancellation semantics via PredictionStreamGuard
Handles late subscribers with bounded replay (1024 events)
Includes 15s keep-alive for SSE streams
Has comprehensive test coverage including lagged stream and truncated replay scenarios

LGTM - all previously flagged issues have been addressed.

ask-bonk · 2026-05-18T16:07:16Z

Posted review for PR #3019. All previously flagged issues have been addressed:

O(n) removal fixed - Uses VecDeque::pop_front() instead of Vec::remove(0)
Memory bounded - STREAM_EVENT_BUFFER_CAPACITY (1024) limits both channel and history
Integration test complete - Includes stdout 'event: start' verification
Constant shared - Channel capacity and history cap use the same constant
Race handled - Guard correctly handles missing entries during cleanup

github run

ask-bonk · 2026-05-18T19:32:50Z

Based on my review of PR #3019, I've verified that all previously flagged issues have been addressed:

Verified fixes:

O(n) removal - Uses VecDeque::pop_front() instead of Vec::remove(0) (prediction.rs:201)
Memory bounded - STREAM_EVENT_BUFFER_CAPACITY (1024) limits both channel and history (prediction.rs:13,155,200)
Shared constant - Channel capacity and history cap use the same constant
Race handled - Guard correctly handles missing entries during cleanup with documented safe defaults (service.rs:147-154)
Integration test - Includes stdout 'event: start' verification (sse_streaming_output.txtar:8)

Implementation highlights:

Clean SSE stream implementation with content negotiation via Accept: text/event-stream
Proper cancellation semantics distinguishing sync vs async predictions via PredictionStreamGuard
Event replay for late subscribers with bounded history (1024 events)
15s keep-alive for SSE streams
Comprehensive test coverage including lagged stream and truncated replay scenarios

LGTM

github run

ask-bonk · 2026-05-18T20:06:32Z

LGTM

github run

ask-bonk · 2026-05-18T20:21:30Z

Based on my thorough review of PR #3019, I've verified that all previously flagged issues have been addressed:

Verified fixes:

O(n) removal - Uses VecDeque::pop_front() (prediction.rs:201) instead of Vec::remove(0)
Memory bounded - STREAM_EVENT_BUFFER_CAPACITY (1024) limits both channel and history (prediction.rs:13,155,200)
Shared constant - Channel capacity and history cap use the same constant
Race handled - Guard correctly handles missing entries during cleanup with documented safe defaults (service.rs:147-154)
Integration test - Includes stdout 'event: start' verification (sse_streaming_output.txtar:8)

Implementation highlights:

Clean SSE stream implementation with content negotiation via Accept: text/event-stream
Proper cancellation semantics distinguishing sync vs async predictions via PredictionStreamGuard
Event replay for late subscribers with bounded history (1024 events)
15s keep-alive for SSE streams
Comprehensive test coverage including lagged stream and truncated replay scenarios

LGTM

github run

markphelps · 2026-05-18T20:26:00Z

@michaeldwan decorator added in 8c9c982

michaeldwan

lgtm

michaeldwan

Summary

Solid design. The broadcast channel + bounded replay architecture is sound, the opt-in gating via @cog.streaming is the right call, and the integration between Go parser → OpenAPI extension → coglet runtime is clean. The replay/live transition has no race (mutex held atomically), the bounded history buffer has no off-by-one, and completed events are emitted on all terminal paths.

One blocker, several should-fixes. Details below and inline.

Blockers

examples/streaming-text/predict.py doesn't use @cog.streaming -- the README tells users to curl with Accept: text/event-stream, which will 406. The integration test sse_streaming_output.txtar correctly uses @streaming, but the shipped example doesn't. (See inline comment.)
No test for SSE events during failed or cancelled predictions. set_failed() and set_canceled() both emit Completed events, but no test verifies the SSE stream delivers event: completed with "status":"failed" / "status":"canceled". These are critical user-facing paths.

Should-fix

decoratorIsCogStreaming hard-codes "cog.streaming" instead of resolving through ImportContext. import cog as c → @c.streaming won't be detected. The rest of the parser handles aliases. (See inline comment.)
@cog.streaming() with parens silently degrades. Parser rejects call form, but Python decorator works either way. User gets a working model where SSE returns 406 with no hint about the parens.
No limit on SSE subscriber count per prediction. subscribe_prediction_stream() creates a new broadcast receiver with no cap. Repeated SSE connections to the same prediction ID amplify memory pressure.
Orphaned pending_cancellations leak memory. Cancel messages arriving after a prediction completes get stored in the HashSet and never cleaned up. (See inline comment.)
Double-clone on every stream event. emit_stream_event clones serde_json::Value for history storage. Arc<PredictionStreamEvent> would eliminate deep clones -- history and broadcast share the same allocation. Also fixes the O(n) deep-clone in subscribe_stream_replay() under the mutex. (See inline comment.)
PredictionStreamGuard::Drop calls tokio::spawn via cancel(). If dropped outside a tokio runtime context, tokio::spawn panics. Use Handle::try_current().
No test for concurrent SSE subscribers. The guard checks stream_receiver_count() == 0 before cancelling, but no test verifies dropping one of two subscribers doesn't cancel.
Training endpoints silently ignore Accept: text/event-stream. Returning 406 or documenting would be more honest than silent fallback to JSON.

Nits

RegisterPredictionMessage 4-element tuple → named struct.
streaming listed under # Metrics in __all__ -- it's a decorator.
Module-level F TypeVar → _F to signal internal.
replay.into() creates unnecessary VecDeque from Vec.
id.to_string() allocated twice in subscribe_prediction_stream.
Missing require.NotNil guard before type assertions in streaming OpenAPI tests.
Broadcast channel capacity and history buffer both use the same 1024 constant by coincidence -- give them separate named constants.

Verified correct

Replay + live transition (no race -- mutex held atomically during subscribe + snapshot)
Bounded history buffer (no off-by-one)
completed event emitted on all terminal paths
Terminal state guards prevent double-completion
findTargetFunction returning decorated_definition handled correctly via UnwrapFunction
cog predict CLI works fine with streaming models (uses sync JSON path)
docs/python.md and docs/llms.txt are accurate and in sync

michaeldwan · 2026-05-19T23:32:11Z

+        ),
+    ) -> Iterator[str]:
+        messages = [{"role": "user", "content": prompt}]
+        text = self.tokenizer.apply_chat_template(


Blocker: This method is missing @cog.streaming. The README (line 28-31) tells users to curl with Accept: text/event-stream, which will return 406 since the model doesn't opt in.

The integration test sse_streaming_output.txtar correctly uses @streaming, but this shipped example doesn't.

from cog import BasePredictor, Input, streaming class Predictor(BasePredictor): # ... @streaming def predict(self, ...) -> Iterator[str]:

michaeldwan · 2026-05-19T23:32:11Z

+            self.stream_history.pop_front();
+            self.stream_history_skipped += 1;
+        }
+        self.stream_history.push_back(event.clone());


Should-fix: This clones the event (containing serde_json::Value) for history, then moves the original into broadcast::send. For high-throughput models yielding many chunks, this deep-clones arbitrarily large JSON on every output.

Consider Arc<PredictionStreamEvent> for the broadcast channel type -- history and broadcast share the same allocation, and subscribe_stream_replay() becomes 1024 atomic increments instead of 1024 deep JSON clones under the prediction mutex.

stream_tx: broadcast::Sender<Arc<PredictionStreamEvent>>, stream_history: VecDeque<Arc<PredictionStreamEvent>>, fn emit_stream_event(&mut self, event: PredictionStreamEvent) { // ... let event = Arc::new(event); self.stream_history.push_back(Arc::clone(&event)); let _ = self.stream_tx.send(event); }

michaeldwan · 2026-05-19T23:32:11Z

                    None => {
-                        tracing::debug!(%prediction_id, "Cancel requested for unknown prediction (may have already completed)");
+                        tracing::debug!(%prediction_id, "Cancel requested for unknown prediction; storing pending cancellation");
+                        pending_cancellations.insert(prediction_id);


Should-fix: If the cancel arrives after the prediction has already completed and been removed from predictions, the ID is stored here and never consumed. In a long-running server with many cancelled predictions, this is an unbounded leak.

Consider adding a size cap (e.g., 1000 entries) or a TTL, and log a warning when it's exceeded.

michaeldwan · 2026-05-19T23:32:11Z

+func decoratorIsCogStreaming(node *sitter.Node, source []byte, imports *schema.ImportContext) bool {
+	for _, child := range NamedChildren(node) {
+		switch child.Type() {
+		case "attribute":


Should-fix: Hard-coded string match. import cog as c then @c.streaming won't be detected -- the content will be "c.streaming", not "cog.streaming".

The rest of the parser resolves aliases through ImportContext (e.g., IsBaseModel, IsOpaque). This should do the same:

case "attribute": text := Content(child, source) parts := strings.SplitN(text, ".", 2) if len(parts) != 2 || parts[1] != "streaming" { return false } entry, ok := imports.Names.Get(parts[0]) return ok && entry.Module == "cog" && entry.Original == "cog"

michaeldwan · 2026-05-19T23:32:11Z

+			entry, ok := imports.Names.Get("streaming")
+			return ok && entry.Module == "cog" && entry.Original == "streaming"
+		case "call":
+			return false


Should-fix: This rejects @cog.streaming() (call form), but the Python decorator works fine with either @streaming or @streaming(). A user who writes @cog.streaming() gets a model that builds, runs, and yields output -- but SSE returns 406 with no hint about the parentheses.

Either support the call form here (check if the callee is cog.streaming or imported streaming), or make the Python decorator raise a clear error when called with parens. The current behavior is a silent gotcha.

michaeldwan · 2026-05-19T23:32:11Z

+        self: &Arc<Self>,
+        id: &str,
+    ) -> Option<PredictionStreamSubscription> {
+        let entry = self.predictions.get(id)?;


Should-fix: No cap on subscriber count. The idempotent PUT endpoint allows repeated SSE connections to the same prediction ID, each creating a new broadcast receiver. An attacker opening many connections forces the sender to retain events for slow consumers.

Consider:

let pred = entry.prediction.lock().ok()?; if pred.stream_receiver_count() >= MAX_STREAM_SUBSCRIBERS { return None; }

michaeldwan · 2026-05-19T23:32:11Z

+        // Prediction cleanup may remove the service entry before the SSE response
+        // finishes draining. Missing entries deliberately report zero receivers and
+        // terminal state so this guard cannot cancel an already-cleaned prediction.
+        if self.service.stream_receiver_count(&self.id) == 0


Should-fix: cancel() internally calls tokio::spawn. If this guard is dropped outside a tokio runtime context (panic unwinding, runtime shutdown), tokio::spawn panics.

Safer:

if let Ok(handle) = tokio::runtime::Handle::try_current() { let service = Arc::clone(&self.service); let id = self.id.clone(); handle.spawn(async move { service.cancel(&id); }); }

Or refactor cancel() itself to use Handle::try_current() before spawning.

michaeldwan · 2026-05-19T23:32:11Z

    "CancelationException",
    # Metrics
    "current_scope",
+    "streaming",


Nit: streaming is a decorator, not a metric. Move it to its own # Decorators section or under # Core classes.

michaeldwan · 2026-05-19T23:32:11Z

    URLPath,
 )

+F = TypeVar("F", bound=Callable[..., object])


Nit: This is importable as from cog import F. Prefix with underscore (_F) to signal it's internal.

michaeldwan · 2026-05-19T23:32:11Z

 }

+type RegisterPredictionMessage = (
+    SlotId,


Nit: A 4-element tuple type alias is opaque -- the ack channel's purpose is invisible at usage sites. A named struct would be clearer:

struct RegisterPredictionMessage { slot_id: SlotId, prediction: Arc<StdMutex<Prediction>>, idle_sender: oneshot::Sender<SlotIdleToken>, registered_ack: oneshot::Sender<()>, }

ask-bonk Bot reviewed May 14, 2026

View reviewed changes

Comment thread crates/coglet/src/prediction.rs Outdated

Base automatically changed from cog-streaming-1 to main May 14, 2026 19:32

markphelps added 2 commits May 14, 2026 15:36

feat: expose prediction SSE streams

2b0b483

docs: add streaming text example

c61e323

markphelps force-pushed the cog-streaming-2-sse branch from cfa865e to c61e323 Compare May 14, 2026 19:38

feat: stream predictions via accept header

bf064d2

fix: bound prediction stream replay history

308ecff

fix: harden SSE prediction streaming

8acf5f9

markphelps marked this pull request as ready for review May 14, 2026 21:37

markphelps requested a review from a team as a code owner May 14, 2026 21:37

markphelps requested a review from michaeldwan May 14, 2026 21:37

markphelps changed the title ~~Expose prediction SSE streams~~ feat: Expose prediction SSE streams May 14, 2026

michaeldwan requested changes May 15, 2026

View reviewed changes

michaeldwan reviewed May 15, 2026

View reviewed changes

fix: address SSE review feedback

ff3140e

ask-bonk Bot reviewed May 18, 2026

View reviewed changes

Merge branch 'main' into cog-streaming-2-sse

562917d

feat: make prediction streaming opt-in

8c9c982

fix: match iterator CLI output in SSE opt-in test

38665ac

markphelps requested a review from michaeldwan May 18, 2026 20:26

michaeldwan approved these changes May 19, 2026

View reviewed changes

michaeldwan self-requested a review May 19, 2026 20:21

michaeldwan requested changes May 19, 2026

View reviewed changes

Conversation

markphelps commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

CLI behavior

Uh oh!

ask-bonk Bot left a comment

Choose a reason for hiding this comment

Issues

1. Unbounded memory growth in stream_history (Medium)

Uh oh!

Uh oh!

ask-bonk Bot commented May 14, 2026

Uh oh!

ask-bonk Bot commented May 14, 2026

Uh oh!

ask-bonk Bot commented May 14, 2026

Uh oh!

ask-bonk Bot commented May 14, 2026

Uh oh!

ask-bonk Bot commented May 14, 2026

Uh oh!

ask-bonk Bot commented May 14, 2026

Uh oh!

michaeldwan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot left a comment

Choose a reason for hiding this comment

Summary

Uh oh!

ask-bonk Bot commented May 18, 2026

Uh oh!

ask-bonk Bot commented May 18, 2026

Uh oh!

ask-bonk Bot commented May 18, 2026

Uh oh!

ask-bonk Bot commented May 18, 2026

Uh oh!

markphelps commented May 18, 2026

Uh oh!

michaeldwan left a comment

Choose a reason for hiding this comment

Uh oh!

michaeldwan left a comment

Choose a reason for hiding this comment

Summary

Blockers

Should-fix

Nits

Verified correct

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markphelps commented May 14, 2026 •

edited

Loading