[Draft] container-reboot-v2: shim-driven transparent reboot prototype by pbozzay · Pull Request #2710 · microsoft/hcsshim

pbozzay · 2026-04-27T21:35:23Z

Status: Draft for design discussion. Not yet ready for review.

What this is

Prototype hcsshim-side changes for transparent reboot of Windows process-isolated (Argon) containers — paired with HCS-side changes on `user/pbozza/container_reboot_v2` in `microsoft/os.2020`.

When a user runs `shutdown /r /t 0` inside a Windows Server container, the goal is for docker / containerd to see the container as continuously running while HCS internally tears down and recreates the silo. The COW overlay and HNS endpoint persist; the container's PID 1 changes (it's a fresh kernel, fresh init).

Design discussion

A higher-level architecture document is in progress (not in this PR). Tl;dr the prototype is the shim-driven half of a hybrid approach where:

HCS emits a `Reboot` notification via the existing `SystemExited` callback (existing `Feature_HcsSiloReboot` plumbing, finished in the matching OS PR)
hcsshim (this PR) catches the notification, recreates the compute system on the same container ID via a cached create document, spawns a fresh init, swaps the task state in place, suppresses `/tasks/exit`
containerd / docker see no exit event; `docker ps` reports the container continuously up

What's in this PR

11 commits, ~500 LOC net. Highlights by area:

Core plumbing — extending the existing notification path with payload data so the SystemExitStatus JSON survives to the shim:

`internal/hcs/callback.go`: notification payload now carries `(err, data)` instead of just `err`
`internal/hcs/system.go`: parse SystemExitStatus, cache `ExitType` on `*hcs.System`, expose via `ExitType()` method on `cow.Container`
`internal/hcs/exitstatus.go`: parser + tests
`internal/cow/cow.go` + LCOW/job container implementations: extend `cow.Container` interface

Stage 4 transparent restart in `cmd/containerd-shim-runhcs-v1/`:

Cache the `hcsDocument` on `*hcs.System` at create time so the recreate can reissue identically
`hcsTask::waitInitExit` detects `ExitType=Reboot`, calls `doHandleReboot`
`doHandleReboot` closes the old System, creates a new one on the same ID, spawns the original init via `cmd.Cmd`, swaps state in place, respawns waiters for the next reboot cycle
Suppresses `/tasks/exit` and `ht.close` on the success path

Dev-guard scaffolding at `internal/devguard/` — registry-key gating so the new behaviors can be opt-in. Five guards:

`ForceStopForRestart`, `ExposeRebootNotification`, `PassExitStatusJson` — gate the matching HCS-side behaviors
`SkipInternalRebootStart` — tells HCS to not try to internally restart, leaving the slot free for the shim
`EnableShimRebootHandler` — gates the shim-side `doHandleReboot` path

Known limitations

Stdio doesn't survive the reboot today. New init runs with nil stdio (the "headless" fallback). `docker logs` only shows pre-reboot output. Interactive `docker run -it` sessions drop when the silo dies. Workaround: `docker exec -it` to reconnect, which gets fresh pipes per-invocation.
`docker inspect` reports the original PID because containerd caches it from the `TaskCreate` event and we don't have a mechanism to update it. Cosmetic; doesn't affect functionality.
WindowsContainerOrchestrator only for now. LCOW / HyperV-isolated / job containers aren't covered.
Dev-guard sprawl: 5 guards is too many for ship; should consolidate to one (`Feature_HcsSiloReboot`).

Validated end-to-end

Works on the test VM with the matching HCS-side changes:

`docker run -d ... cmd /c "ping -n 999999 127.0.0.1"` as init
`docker exec` to write a marker file
`docker exec -d ... shutdown /r /t 0` to trigger reboot
After ~25s: `docker ps` still shows Up
`docker exec` shows the marker file persisted (overlay survived)
`docker exec ... tasklist` shows different PIDs (kernel really rebooted)
`docker stop` and `docker rm` work cleanly

Not for review yet

Pushing now to make the changes visible alongside the design conversation. Cleanup, dev-guard consolidation, test coverage, and (most importantly) HCS-side stdio preservation would all happen before this is review-ready.

Task 1.9 of the container-reboot-v2 plan. Adds internal/devguard package that reads HKLM\Software\Microsoft\HCS\Dev\Reboot\<Name> DWORDs at runtime, mirroring the HcsDev::Reboot::* accessors on the HCS C++ side. Five named guard constants exported (ForceStopForRestart, ExposeRebootNotification, PassExitStatusJson, SkipInternalRebootStart, EnableShimRebootHandler). IsEnabled() opens the registry key, reads the DWORD, closes. No caching; every call is a fresh read so reg flips take effect on the next event. Missing key, missing value, wrong type, or access-denied all return false. Three TDD unit tests cover missing key, zero value, and non-zero value.

…ders) Task 1.10 of the container-reboot-v2 plan. Adds OpenCensus span attributes along the reboot observation path: - internal/hcs/system.go::waitBackground — reboot.exit_type (string, empty) and reboot.notification_data_bytes (int64, 0). Populated by Stage 2 once notificationWatcher parses SystemExitStatus JSON. - cmd/containerd-shim-runhcs-v1/exec_hcs.go::waitForContainerExit — reboot.pending (bool, false). Flipped by Stage 4 when the shim observes a Reboot exit_type and sets hcsExec.rebootPending instead of killing init. - cmd/containerd-shim-runhcs-v1/task_hcs.go::waitInitExit — reboot.pending (bool, false). Flipped by Stage 4 when dispatching to handleReboot. Placeholder values only; this stage introduces no behavior change and keeps the baseline trace signature consistent with future-populated runs.

Task 2.4 of the container-reboot-v2 plan. Prior to this change the HCS notification channel was typed chan error — the Win32 callback's notificationData pointer was silently discarded. Callers observing hcsNotificationSystemExited could therefore never see the SystemExitStatus JSON, so ExitType=Reboot was invisible on the shim side. - Introduce notificationPayload{err,data} struct and retype the channel. - In notificationWatcher, materialize notificationData (null-terminated UTF-16) into payload.data via a new utf16PtrToString helper. Nil pointer yields '' data — the common case for non-Exited notifications. - waithelper.go readers consume payload.err; payload.data is ignored here (consumed by System.waitBackground in Task 2.5). Two TDD unit tests in callback_test.go cover the happy path (JSON payload round-trips intact) and the nil-data case (benign).

Task 2.5 of the container-reboot-v2 plan. - Add internal/hcs/exitstatus.go with systemExitStatus struct mirroring the HCS schema (Status, ExitType) and parseExitType helper. Unmarshal errors propagate; empty/missing payload returns ('', nil) so callers don't see spurious errors on non-exited notifications. - Add exitType + exitTypeMu fields on *System plus an ExitType() getter (RLocked). Empty string before exit; 'Reboot' et al once populated. - Wire into System.waitBackground: peek the SystemExitStatus payload ourselves before the existing err-only flow so we capture payload.data (the JSON). The peek replaces waitForNotification for this one notification type because waitForNotification's select is err-only — we'd lose the payload otherwise. System.waitBackground is the sole reader of this channel for the compute system's lifetime so the split is safe; other waiters go through waitForNotification on other notification types. Fallback path preserved for the 'callback context gone' edge case. - Replace the Stage 1 placeholder span attrs (reboot.exit_type='', reboot.notification_data_bytes=0) with real values from the parsed payload. Tests: 5 new parseExitType cases covering Reboot, GracefulExit, empty, malformed JSON (returns err), and missing ExitType field (benign '').

Task 2.6 of the container-reboot-v2 plan. Extends the cow.Container interface with ExitType() string so callers can observe the parsed SystemExitStatus.ExitType carried up by *hcs.System. *hcs.System already implements it (Task 2.5). Stub two other cow.Container implementers to return '': - *gcs.Container: talks to the LCOW guest directly, never sees an HCS SystemExitStatus. container-reboot-v2 is Argon-only so the LCOW path is out of scope; empty string is the correct 'unknown/fallback' answer. - *jobcontainers.JobContainer: doesn't wrap an HCS compute system at all. Callers treat empty string as 'unknown, use previous exit-handling logic', so these stubs preserve existing behavior on non-Argon paths.

Task 2.7 of the container-reboot-v2 plan. When hcsExec.waitForContainerExit observes the compute-system exit, surface the parsed ExitType via a logrus Info entry — no behavior change, just a stable observability checkpoint. Logs any non-empty ExitType, not just Reboot, so the shim trace reports GracefulExit / UnexpectedExit the same way. Stage 4's handleReboot is where the Reboot branch finally diverges from teardown; this log stays useful in production as a compact 'what did HCS tell us' record.

…hook Add a Reboot-observation point in hcsTask::waitInitExit, gated by EnableShimRebootHandler. When a silo exits with ExitType=Reboot, emit a stable Info log and set reboot.pending=true on the waitInitExit span. No behavior change — teardown still runs — this is the reliable hook Sub-step B will extend with actual handleReboot logic. Why here vs hcsExec::waitForContainerExit: waitForContainerExit has a select between the container's WaitChannel (silo termination) and the init exec's processDone (init process exit). For an Argon reboot both fire near-simultaneously and in the Stage 3 validation runs processDone won the race — meaning the existing Stage 2 log in exec_hcs.go NEVER fired despite the reboot signal being present. waitInitExit runs unconditionally after init.Wait() returns, so it's a single, deterministic intercept. Timing subtlety (debugged in-session): cow.Container.ExitType() is only defined AFTER WaitChannel() closes (cow.go:101). init.Wait() returns when the init PROCESS exits, but *hcs.System.waitBackground (which parses SystemExitStatus JSON into ExitType) runs on the system-level exit notification — a separate HCS callback. First run returned "" 100% of the time because the ExitType read happened ~22ms before waitBackground finished. Fix: block on ht.c.WaitChannel() (with 5s timeout warning) before reading ExitType. Verified 2026-04-23 18:33 on reboot-v3: Span hcsTask::waitInitExit ... reboot.pending=true level=info msg="reboot-v2 Stage 4: would handle reboot here (no action; falling through to teardown)" reboot.exit_type=Reboot

…ate is possible Two-part change, observation-only (no actual restart semantic yet): B1 - internal/hcs/system.go: cache the hcsDocument on *System at creation time, expose via System.CreateDocument() as a json.RawMessage. Previously the document was assembled in hcsoci/create.go, marshaled, and discarded; now it's retained on the System for later reissue by Sub-step B3's handleReboot. B2 - cmd/containerd-shim-runhcs-v1/task_hcs.go: in waitInitExit's Reboot branch, BEFORE ht.close() (so the WCIFS overlay + HNS endpoint are still live), run a probeSameIDRecreate that: 1. Closes the old *hcs.System handle 2. Calls hcs.CreateComputeSystem with the stashed doc on the same container ID 3. Calls Start on the new system 4. Logs each outcome, then Terminate+Wait+Close to clean up so the existing teardown path sees an empty slot The point of the probe is to answer the Sub-step B design question: does HCS reject same-ID recreate? Can the new silo pick up the old overlay and HNS endpoint automatically? Verified 2026-04-23 20:00 on reboot-v3 with all 5 guards on: reboot-v2 B2: closing old system handle before recreate probe (doc_bytes=700) reboot-v2 B2: CreateComputeSystem SUCCEEDED on same ID; calling Start reboot-v2 B2: Start SUCCEEDED — full create+start cycle works on same ID Both assumptions from the research doc confirmed: (1) HCS accepts the recreate with zero friction, (2) the overlay layer + HNS endpoint registered for the container ID are reused by the new silo without re-running hcsoci.CreateContainer. Sub-step B3 can now wire this into ht.c / ht.init for an actual transparent restart.

… recreated silo Extends probeSameIDRecreate: after hcs.CreateComputeSystem + newSys.Start succeed on the reboot-recreated silo, also spawn a benign init process via cmd.Cmd (mirroring the hcsExec.startInternal path). Waits for the probe process to exit, logs the PID and exit code. Uses a benign spec (cmd /c hostname) instead of ht.taskSpec.Process because the real task spec on the current test-bed runs `shutdown /r` and would cascade into an infinite reboot chain if re-executed on the new silo. B3a is mechanics-only; B3b will use the unmodified spec once the state-machine swap eliminates the cascade risk. Verified 2026-04-23 21:16 on reboot-v3: reboot-v2 B2: closing old system handle before recreate probe (doc_bytes=700) reboot-v2 B2: CreateComputeSystem SUCCEEDED on same ID; calling Start reboot-v2 B2: Start SUCCEEDED — full create+start cycle works on same ID reboot-v2 B3a: probe init-process spawned probe.pid=2024 reboot-v2 B3a: probe init-process exited — full recreate+spawn cycle verified probe.exit_code=0 The full HCS-API mechanics for transparent restart are now proven: Close old handle -> CreateComputeSystem (same ID) -> System.Start -> cmd.Start (init process). Each step logged with unambiguous success markers. Sub-step B3b is the remaining piece: wire the new System and new init exec into ht.c and ht.init, suppress ht.close(), so containerd sees no /tasks/exit event. That's a shim-state-machine change, not an HCS-API question.

First working transparent restart. On Reboot detection in waitInitExit, the shim now: 1. Closes the old *hcs.System handle 2. Calls hcs.CreateComputeSystem with the cached document on the same ID 3. Starts the new System 4. Spawns the original init process (ht.taskSpec.Process) via cmd.Cmd 5. Swaps ht.c = newSys 6. Resets hcsExec state in-place under sl lock: c, p, pid, state=Running, exitStatus=255, exitedAt=zero, fresh processDone/exited channels + fresh sync.Once values 7. Respawns waitForExit to track the new init process 8. Returns from waitInitExit WITHOUT calling ht.close(ctx) — no TaskExit event published, task logically still Running Verified 2026-04-23 21:39 on reboot-v3: reboot-v2 Stage 4: reboot observed; attempting transparent restart (B3b) reboot-v2 B3b: closing old system handle (doc_bytes=700) reboot-v2 B3b: new System created on same ID reboot-v2 B3b: new System started reboot-v2 B3b: new init process spawned new.pid=1848 reboot-v2 B3b: task state swapped; container logically still Running reboot-v2 B3b: transparent restart completed; suppressing teardown Docker reported the container as "Up About a minute" for the full window between reboot-handled and our manual cleanup — FIRST TIME the transparent restart is user-visible end-to-end. KNOWN LIMITATIONS (Stage 5 cleanup): * Stdio pipes: oldExec.io's upstream pipes were closed by the original init-exit path before our doHandleReboot ran. The new cmd.Cmd tries to reuse those closed pipes — immediately gets "file has already been closed" on stdout relay. The new init process is effectively blind. Fix: reopen the upstream IO pipes via NewUpstreamIO before spawning the new init. * No reboot loop: if the new silo reboots again, we fall through to normal exit because waitInitExit already returned. Fix: respawn waitInitExit (or restructure as a for-loop) after handleReboot. * Docker exec / docker rm deadlock: after the first restart, docker commands against the container hang. Root cause likely in the closed- stdio state or in our respawned waitForExit hitting an invalid IO. Needs debug + fix before this is shippable. * PID visibility: containerd caches the original init PID from the TaskCreate event. docker inspect still reports the old PID even after successful restart. Cosmetic for now; a /tasks/start republish (or a new /tasks/reboot event type) would address it. probeSameIDRecreate is retained as-is for reference / fallback during iteration — will be removed once Sub-step C (loop + stdio fix) lands.

…back Two follow-up fixes on top of B3b's transparent-restart prototype: 1. Reboot loop. After a successful handleReboot, respawn waitInitExit as a goroutine so a subsequent in-container reboot is also handled. Each cycle spawns a fresh waiter for the next one. The chain terminates naturally when the task ends (non-Reboot exit) or when an external docker stop/rm drives teardown. 2. Fresh stdio with headless fallback. Original plan: reopen the containerd-owned pipes with NewUpstreamIO against the original paths. Observed behavior: containerd tears down its server-side pipes when the shim's client disconnects during the original init-exit path, so NewUpstreamIO fails with "system cannot find the file specified" every time. Pragmatic fix: fall back to nil stdio and let the new init run headless. The process still runs, docker still sees the container as Up, follow-up ops (exec/stop/rm) no longer deadlock. Full stdio reattach needs a containerd-side change (or a shim pipe- republish protocol) and is out of scope for the prototype. Also stash req.Stdin/Stdout/Stderr/Terminal on hcsTask at newHcsTask time so doHandleReboot can re-attempt NewUpstreamIO with the original paths even though it's expected to fail today. Verified 2026-04-23 22:18 on reboot-v3, one container lifecycle: t=0: docker run servercore cmd /c "start /b shutdown /r & ping -t" t=~33s: reboot cycle 1 — new.pid=5368 t=~63s: reboot cycle 2 — new.pid=1180 t=~93s: reboot cycle 3 — new.pid=6636 docker ps: "Up About a minute" throughout docker stop: Exited (1067) docker rm: clean removal Remaining B3 gaps are non-deadlocking and mostly cosmetic: * Stdio not visible after restart (fundamental — needs containerd change) * docker inspect reports the original PID (cached in containerd's task state; would need /tasks/start republish or a new /tasks/reboot topic) * On shim shutdown the last headless silo may linger briefly (cleanup timing; doesn't affect user-facing behavior)

Now that the full transparent-reboot flow (detection -> create same-ID -> spawn init -> state swap -> reboot loop) is working end-to-end, clean up the Stage 4 iteration scaffolding: - Remove probeSameIDRecreate function entirely. It was retained as a reference/fallback during iteration but is superseded by doHandleReboot and has no callers. - Collapse "reboot-v2 B3b:" / "reboot-v2 B3c:" log prefixes to just "reboot-v2:". The sub-step labels were useful for differentiating probe runs during iteration but add noise now that there's a single reboot code path. - Update the doHandleReboot docstring to reflect the final flow (all 9 steps including fresh stdio + reboot loop) and its actual known gaps (stdio reattach, PID cache), removing the "B3c will do this later" TODO-style notes that no longer apply. - Update the caller-site comment in waitInitExit to document that the reboot loop is the explicit reason we return without ht.close() — the respawned waitInitExit handles any subsequent reboot. No behavior change. Verified green build (go build -ldflags "-s -w"). Next: redeploy + re-run the reboot cycle test to confirm nothing regressed, then snapshot. -143/+44 LOC net.

Paul Bozzay added 12 commits April 20, 2026 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] container-reboot-v2: shim-driven transparent reboot prototype#2710

[Draft] container-reboot-v2: shim-driven transparent reboot prototype#2710
pbozzay wants to merge 12 commits intomicrosoft:mainfrom
pbozzay:user/pbozza/hcsshim_reboot_v2

pbozzay commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pbozzay commented Apr 27, 2026

What this is

Design discussion

What's in this PR

Known limitations

Validated end-to-end

Not for review yet

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants