Skip to content

gcs: keep bridge alive across live-migration transport swap#2771

Open
rawahars wants to merge 1 commit into
microsoft:mainfrom
rawahars:lm_gcs_bridge
Open

gcs: keep bridge alive across live-migration transport swap#2771
rawahars wants to merge 1 commit into
microsoft:mainfrom
rawahars:lm_gcs_bridge

Conversation

@rawahars

Copy link
Copy Markdown
Contributor

Add SetMigrating / ResumeOnConn on the bridge (plumbed through GuestConnection and Guest) so callers can park the recv/send loops during a UVM migration blackout and swap in the new hvsock without dropping in-flight RPCs. CreateConnection gains a coldStart bool so the migration destination skips the fresh-boot handshake.

Drive-bys: shim Stop honours caller ctx, Capabilities is nil-safe, ErrGuestConnectionUnavailable is exported, add session-id/action log fields.

Add SetMigrating / ResumeOnConn on the bridge (plumbed through
GuestConnection and Guest) so callers can park the recv/send loops
during a UVM migration blackout and swap in the new hvsock without
dropping in-flight RPCs. CreateConnection gains a coldStart bool so
the migration destination skips the fresh-boot handshake.

Drive-bys: shim Stop honours caller ctx, Capabilities is nil-safe,
ErrGuestConnectionUnavailable is exported, add session-id/action
log fields.

Signed-off-by: Harsh Rawat <harshrawat@microsoft.com>
@rawahars rawahars requested a review from a team as a code owner June 11, 2026 19:56
Comment thread internal/gcs/bridge.go
// SetMigrating toggles tolerance of transport-level failures around a
// live-migration blackout. Explicit [bridge.Close] and the RPC timeout
// kill still tear the bridge down.
func (brdg *bridge) SetMigrating(migrating bool) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine but I dont get why its necessary. Why would we want transport level tolerance only when migrating? I get that this is a local loopback connection so in practice it likely never disconnects but doesnt it seem reasonable to just implement the bridge such that on disconnect its auto paused, and on reconnect it continues? No policy needed ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t want to do that under normal circumstances. This is because our shim depends on the invariant that if the bridge collapses then it’s a fatal error and all the Waits are released and thereafter, the workflow goes into teardown mode.

Just during migration, we avoid the same, so that in case of restore on rollback, we can resume over a fresh socket connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants