[Draft] container-reboot-v2: shim-driven transparent reboot prototype#2710
Draft
pbozzay wants to merge 12 commits intomicrosoft:mainfrom
Draft
[Draft] container-reboot-v2: shim-driven transparent reboot prototype#2710pbozzay wants to merge 12 commits intomicrosoft:mainfrom
pbozzay wants to merge 12 commits intomicrosoft:mainfrom
Conversation
added 12 commits
April 20, 2026 00:57
Task 1.9 of the container-reboot-v2 plan. Adds internal/devguard package that reads HKLM\Software\Microsoft\HCS\Dev\Reboot\<Name> DWORDs at runtime, mirroring the HcsDev::Reboot::* accessors on the HCS C++ side. Five named guard constants exported (ForceStopForRestart, ExposeRebootNotification, PassExitStatusJson, SkipInternalRebootStart, EnableShimRebootHandler). IsEnabled() opens the registry key, reads the DWORD, closes. No caching; every call is a fresh read so reg flips take effect on the next event. Missing key, missing value, wrong type, or access-denied all return false. Three TDD unit tests cover missing key, zero value, and non-zero value.
…ders) Task 1.10 of the container-reboot-v2 plan. Adds OpenCensus span attributes along the reboot observation path: - internal/hcs/system.go::waitBackground — reboot.exit_type (string, empty) and reboot.notification_data_bytes (int64, 0). Populated by Stage 2 once notificationWatcher parses SystemExitStatus JSON. - cmd/containerd-shim-runhcs-v1/exec_hcs.go::waitForContainerExit — reboot.pending (bool, false). Flipped by Stage 4 when the shim observes a Reboot exit_type and sets hcsExec.rebootPending instead of killing init. - cmd/containerd-shim-runhcs-v1/task_hcs.go::waitInitExit — reboot.pending (bool, false). Flipped by Stage 4 when dispatching to handleReboot. Placeholder values only; this stage introduces no behavior change and keeps the baseline trace signature consistent with future-populated runs.
Task 2.4 of the container-reboot-v2 plan. Prior to this change the HCS
notification channel was typed chan error — the Win32 callback's
notificationData pointer was silently discarded. Callers observing
hcsNotificationSystemExited could therefore never see the
SystemExitStatus JSON, so ExitType=Reboot was invisible on the shim side.
- Introduce notificationPayload{err,data} struct and retype the channel.
- In notificationWatcher, materialize notificationData (null-terminated
UTF-16) into payload.data via a new utf16PtrToString helper. Nil pointer
yields '' data — the common case for non-Exited notifications.
- waithelper.go readers consume payload.err; payload.data is ignored
here (consumed by System.waitBackground in Task 2.5).
Two TDD unit tests in callback_test.go cover the happy path (JSON
payload round-trips intact) and the nil-data case (benign).
Task 2.5 of the container-reboot-v2 plan.
- Add internal/hcs/exitstatus.go with systemExitStatus struct mirroring
the HCS schema (Status, ExitType) and parseExitType helper. Unmarshal
errors propagate; empty/missing payload returns ('', nil) so callers
don't see spurious errors on non-exited notifications.
- Add exitType + exitTypeMu fields on *System plus an ExitType() getter
(RLocked). Empty string before exit; 'Reboot' et al once populated.
- Wire into System.waitBackground: peek the SystemExitStatus payload
ourselves before the existing err-only flow so we capture payload.data
(the JSON). The peek replaces waitForNotification for this one
notification type because waitForNotification's select is err-only —
we'd lose the payload otherwise. System.waitBackground is the sole
reader of this channel for the compute system's lifetime so the split
is safe; other waiters go through waitForNotification on other
notification types. Fallback path preserved for the 'callback context
gone' edge case.
- Replace the Stage 1 placeholder span attrs (reboot.exit_type='',
reboot.notification_data_bytes=0) with real values from the parsed
payload.
Tests: 5 new parseExitType cases covering Reboot, GracefulExit, empty,
malformed JSON (returns err), and missing ExitType field (benign '').
Task 2.6 of the container-reboot-v2 plan. Extends the cow.Container interface with ExitType() string so callers can observe the parsed SystemExitStatus.ExitType carried up by *hcs.System. *hcs.System already implements it (Task 2.5). Stub two other cow.Container implementers to return '': - *gcs.Container: talks to the LCOW guest directly, never sees an HCS SystemExitStatus. container-reboot-v2 is Argon-only so the LCOW path is out of scope; empty string is the correct 'unknown/fallback' answer. - *jobcontainers.JobContainer: doesn't wrap an HCS compute system at all. Callers treat empty string as 'unknown, use previous exit-handling logic', so these stubs preserve existing behavior on non-Argon paths.
Task 2.7 of the container-reboot-v2 plan. When hcsExec.waitForContainerExit observes the compute-system exit, surface the parsed ExitType via a logrus Info entry — no behavior change, just a stable observability checkpoint. Logs any non-empty ExitType, not just Reboot, so the shim trace reports GracefulExit / UnexpectedExit the same way. Stage 4's handleReboot is where the Reboot branch finally diverges from teardown; this log stays useful in production as a compact 'what did HCS tell us' record.
…hook
Add a Reboot-observation point in hcsTask::waitInitExit, gated by
EnableShimRebootHandler. When a silo exits with ExitType=Reboot, emit
a stable Info log and set reboot.pending=true on the waitInitExit span.
No behavior change — teardown still runs — this is the reliable hook
Sub-step B will extend with actual handleReboot logic.
Why here vs hcsExec::waitForContainerExit:
waitForContainerExit has a select between the container's WaitChannel
(silo termination) and the init exec's processDone (init process exit).
For an Argon reboot both fire near-simultaneously and in the Stage 3
validation runs processDone won the race — meaning the existing Stage
2 log in exec_hcs.go NEVER fired despite the reboot signal being
present. waitInitExit runs unconditionally after init.Wait() returns,
so it's a single, deterministic intercept.
Timing subtlety (debugged in-session):
cow.Container.ExitType() is only defined AFTER WaitChannel() closes
(cow.go:101). init.Wait() returns when the init PROCESS exits, but
*hcs.System.waitBackground (which parses SystemExitStatus JSON into
ExitType) runs on the system-level exit notification — a separate
HCS callback. First run returned "" 100% of the time because the
ExitType read happened ~22ms before waitBackground finished. Fix:
block on ht.c.WaitChannel() (with 5s timeout warning) before reading
ExitType.
Verified 2026-04-23 18:33 on reboot-v3:
Span hcsTask::waitInitExit ... reboot.pending=true
level=info msg="reboot-v2 Stage 4: would handle reboot here
(no action; falling through to teardown)" reboot.exit_type=Reboot
…ate is possible
Two-part change, observation-only (no actual restart semantic yet):
B1 - internal/hcs/system.go: cache the hcsDocument on *System at creation
time, expose via System.CreateDocument() as a json.RawMessage. Previously
the document was assembled in hcsoci/create.go, marshaled, and discarded;
now it's retained on the System for later reissue by Sub-step B3's
handleReboot.
B2 - cmd/containerd-shim-runhcs-v1/task_hcs.go: in waitInitExit's Reboot
branch, BEFORE ht.close() (so the WCIFS overlay + HNS endpoint are still
live), run a probeSameIDRecreate that:
1. Closes the old *hcs.System handle
2. Calls hcs.CreateComputeSystem with the stashed doc on the same container ID
3. Calls Start on the new system
4. Logs each outcome, then Terminate+Wait+Close to clean up so the
existing teardown path sees an empty slot
The point of the probe is to answer the Sub-step B design question: does
HCS reject same-ID recreate? Can the new silo pick up the old overlay and
HNS endpoint automatically?
Verified 2026-04-23 20:00 on reboot-v3 with all 5 guards on:
reboot-v2 B2: closing old system handle before recreate probe (doc_bytes=700)
reboot-v2 B2: CreateComputeSystem SUCCEEDED on same ID; calling Start
reboot-v2 B2: Start SUCCEEDED — full create+start cycle works on same ID
Both assumptions from the research doc confirmed: (1) HCS accepts the
recreate with zero friction, (2) the overlay layer + HNS endpoint
registered for the container ID are reused by the new silo without
re-running hcsoci.CreateContainer. Sub-step B3 can now wire this into
ht.c / ht.init for an actual transparent restart.
… recreated silo
Extends probeSameIDRecreate: after hcs.CreateComputeSystem + newSys.Start
succeed on the reboot-recreated silo, also spawn a benign init process
via cmd.Cmd (mirroring the hcsExec.startInternal path). Waits for the
probe process to exit, logs the PID and exit code.
Uses a benign spec (cmd /c hostname) instead of ht.taskSpec.Process
because the real task spec on the current test-bed runs `shutdown /r`
and would cascade into an infinite reboot chain if re-executed on the
new silo. B3a is mechanics-only; B3b will use the unmodified spec once
the state-machine swap eliminates the cascade risk.
Verified 2026-04-23 21:16 on reboot-v3:
reboot-v2 B2: closing old system handle before recreate probe (doc_bytes=700)
reboot-v2 B2: CreateComputeSystem SUCCEEDED on same ID; calling Start
reboot-v2 B2: Start SUCCEEDED — full create+start cycle works on same ID
reboot-v2 B3a: probe init-process spawned probe.pid=2024
reboot-v2 B3a: probe init-process exited — full recreate+spawn cycle verified
probe.exit_code=0
The full HCS-API mechanics for transparent restart are now proven:
Close old handle -> CreateComputeSystem (same ID) -> System.Start ->
cmd.Start (init process). Each step logged with unambiguous success
markers. Sub-step B3b is the remaining piece: wire the new System and
new init exec into ht.c and ht.init, suppress ht.close(), so containerd
sees no /tasks/exit event. That's a shim-state-machine change, not an
HCS-API question.
First working transparent restart. On Reboot detection in waitInitExit,
the shim now:
1. Closes the old *hcs.System handle
2. Calls hcs.CreateComputeSystem with the cached document on the same ID
3. Starts the new System
4. Spawns the original init process (ht.taskSpec.Process) via cmd.Cmd
5. Swaps ht.c = newSys
6. Resets hcsExec state in-place under sl lock: c, p, pid, state=Running,
exitStatus=255, exitedAt=zero, fresh processDone/exited channels +
fresh sync.Once values
7. Respawns waitForExit to track the new init process
8. Returns from waitInitExit WITHOUT calling ht.close(ctx) — no TaskExit
event published, task logically still Running
Verified 2026-04-23 21:39 on reboot-v3:
reboot-v2 Stage 4: reboot observed; attempting transparent restart (B3b)
reboot-v2 B3b: closing old system handle (doc_bytes=700)
reboot-v2 B3b: new System created on same ID
reboot-v2 B3b: new System started
reboot-v2 B3b: new init process spawned new.pid=1848
reboot-v2 B3b: task state swapped; container logically still Running
reboot-v2 B3b: transparent restart completed; suppressing teardown
Docker reported the container as "Up About a minute" for the full window
between reboot-handled and our manual cleanup — FIRST TIME the transparent
restart is user-visible end-to-end.
KNOWN LIMITATIONS (Stage 5 cleanup):
* Stdio pipes: oldExec.io's upstream pipes were closed by the original
init-exit path before our doHandleReboot ran. The new cmd.Cmd tries to
reuse those closed pipes — immediately gets "file has already been
closed" on stdout relay. The new init process is effectively blind.
Fix: reopen the upstream IO pipes via NewUpstreamIO before spawning
the new init.
* No reboot loop: if the new silo reboots again, we fall through to
normal exit because waitInitExit already returned. Fix: respawn
waitInitExit (or restructure as a for-loop) after handleReboot.
* Docker exec / docker rm deadlock: after the first restart, docker
commands against the container hang. Root cause likely in the closed-
stdio state or in our respawned waitForExit hitting an invalid IO.
Needs debug + fix before this is shippable.
* PID visibility: containerd caches the original init PID from the
TaskCreate event. docker inspect still reports the old PID even after
successful restart. Cosmetic for now; a /tasks/start republish (or a
new /tasks/reboot event type) would address it.
probeSameIDRecreate is retained as-is for reference / fallback during
iteration — will be removed once Sub-step C (loop + stdio fix) lands.
…back
Two follow-up fixes on top of B3b's transparent-restart prototype:
1. Reboot loop. After a successful handleReboot, respawn waitInitExit as
a goroutine so a subsequent in-container reboot is also handled. Each
cycle spawns a fresh waiter for the next one. The chain terminates
naturally when the task ends (non-Reboot exit) or when an external
docker stop/rm drives teardown.
2. Fresh stdio with headless fallback. Original plan: reopen the
containerd-owned pipes with NewUpstreamIO against the original paths.
Observed behavior: containerd tears down its server-side pipes when
the shim's client disconnects during the original init-exit path, so
NewUpstreamIO fails with "system cannot find the file specified"
every time. Pragmatic fix: fall back to nil stdio and let the new
init run headless. The process still runs, docker still sees the
container as Up, follow-up ops (exec/stop/rm) no longer deadlock.
Full stdio reattach needs a containerd-side change (or a shim pipe-
republish protocol) and is out of scope for the prototype.
Also stash req.Stdin/Stdout/Stderr/Terminal on hcsTask at newHcsTask
time so doHandleReboot can re-attempt NewUpstreamIO with the original
paths even though it's expected to fail today.
Verified 2026-04-23 22:18 on reboot-v3, one container lifecycle:
t=0: docker run servercore cmd /c "start /b shutdown /r & ping -t"
t=~33s: reboot cycle 1 — new.pid=5368
t=~63s: reboot cycle 2 — new.pid=1180
t=~93s: reboot cycle 3 — new.pid=6636
docker ps: "Up About a minute" throughout
docker stop: Exited (1067)
docker rm: clean removal
Remaining B3 gaps are non-deadlocking and mostly cosmetic:
* Stdio not visible after restart (fundamental — needs containerd change)
* docker inspect reports the original PID (cached in containerd's task
state; would need /tasks/start republish or a new /tasks/reboot topic)
* On shim shutdown the last headless silo may linger briefly (cleanup
timing; doesn't affect user-facing behavior)
Now that the full transparent-reboot flow (detection -> create same-ID -> spawn init -> state swap -> reboot loop) is working end-to-end, clean up the Stage 4 iteration scaffolding: - Remove probeSameIDRecreate function entirely. It was retained as a reference/fallback during iteration but is superseded by doHandleReboot and has no callers. - Collapse "reboot-v2 B3b:" / "reboot-v2 B3c:" log prefixes to just "reboot-v2:". The sub-step labels were useful for differentiating probe runs during iteration but add noise now that there's a single reboot code path. - Update the doHandleReboot docstring to reflect the final flow (all 9 steps including fresh stdio + reboot loop) and its actual known gaps (stdio reattach, PID cache), removing the "B3c will do this later" TODO-style notes that no longer apply. - Update the caller-site comment in waitInitExit to document that the reboot loop is the explicit reason we return without ht.close() — the respawned waitInitExit handles any subsequent reboot. No behavior change. Verified green build (go build -ldflags "-s -w"). Next: redeploy + re-run the reboot cycle test to confirm nothing regressed, then snapshot. -143/+44 LOC net.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: Draft for design discussion. Not yet ready for review.
What this is
Prototype hcsshim-side changes for transparent reboot of Windows process-isolated (Argon) containers — paired with HCS-side changes on `user/pbozza/container_reboot_v2` in `microsoft/os.2020`.
When a user runs `shutdown /r /t 0` inside a Windows Server container, the goal is for docker / containerd to see the container as continuously running while HCS internally tears down and recreates the silo. The COW overlay and HNS endpoint persist; the container's PID 1 changes (it's a fresh kernel, fresh init).
Design discussion
A higher-level architecture document is in progress (not in this PR). Tl;dr the prototype is the shim-driven half of a hybrid approach where:
What's in this PR
11 commits, ~500 LOC net. Highlights by area:
Core plumbing — extending the existing notification path with payload data so the SystemExitStatus JSON survives to the shim:
Stage 4 transparent restart in `cmd/containerd-shim-runhcs-v1/`:
Dev-guard scaffolding at `internal/devguard/` — registry-key gating so the new behaviors can be opt-in. Five guards:
Known limitations
Validated end-to-end
Works on the test VM with the matching HCS-side changes:
Not for review yet
Pushing now to make the changes visible alongside the design conversation. Cleanup, dev-guard consolidation, test coverage, and (most importantly) HCS-side stdio preservation would all happen before this is review-ready.