Skip to content

[Draft] container-reboot-v2: shim-driven transparent reboot prototype#2710

Draft
pbozzay wants to merge 12 commits intomicrosoft:mainfrom
pbozzay:user/pbozza/hcsshim_reboot_v2
Draft

[Draft] container-reboot-v2: shim-driven transparent reboot prototype#2710
pbozzay wants to merge 12 commits intomicrosoft:mainfrom
pbozzay:user/pbozza/hcsshim_reboot_v2

Conversation

@pbozzay
Copy link
Copy Markdown

@pbozzay pbozzay commented Apr 27, 2026

Status: Draft for design discussion. Not yet ready for review.

What this is

Prototype hcsshim-side changes for transparent reboot of Windows process-isolated (Argon) containers — paired with HCS-side changes on `user/pbozza/container_reboot_v2` in `microsoft/os.2020`.

When a user runs `shutdown /r /t 0` inside a Windows Server container, the goal is for docker / containerd to see the container as continuously running while HCS internally tears down and recreates the silo. The COW overlay and HNS endpoint persist; the container's PID 1 changes (it's a fresh kernel, fresh init).

Design discussion

A higher-level architecture document is in progress (not in this PR). Tl;dr the prototype is the shim-driven half of a hybrid approach where:

  • HCS emits a `Reboot` notification via the existing `SystemExited` callback (existing `Feature_HcsSiloReboot` plumbing, finished in the matching OS PR)
  • hcsshim (this PR) catches the notification, recreates the compute system on the same container ID via a cached create document, spawns a fresh init, swaps the task state in place, suppresses `/tasks/exit`
  • containerd / docker see no exit event; `docker ps` reports the container continuously up

What's in this PR

11 commits, ~500 LOC net. Highlights by area:

Core plumbing — extending the existing notification path with payload data so the SystemExitStatus JSON survives to the shim:

  • `internal/hcs/callback.go`: notification payload now carries `(err, data)` instead of just `err`
  • `internal/hcs/system.go`: parse SystemExitStatus, cache `ExitType` on `*hcs.System`, expose via `ExitType()` method on `cow.Container`
  • `internal/hcs/exitstatus.go`: parser + tests
  • `internal/cow/cow.go` + LCOW/job container implementations: extend `cow.Container` interface

Stage 4 transparent restart in `cmd/containerd-shim-runhcs-v1/`:

  • Cache the `hcsDocument` on `*hcs.System` at create time so the recreate can reissue identically
  • `hcsTask::waitInitExit` detects `ExitType=Reboot`, calls `doHandleReboot`
  • `doHandleReboot` closes the old System, creates a new one on the same ID, spawns the original init via `cmd.Cmd`, swaps state in place, respawns waiters for the next reboot cycle
  • Suppresses `/tasks/exit` and `ht.close` on the success path

Dev-guard scaffolding at `internal/devguard/` — registry-key gating so the new behaviors can be opt-in. Five guards:

  • `ForceStopForRestart`, `ExposeRebootNotification`, `PassExitStatusJson` — gate the matching HCS-side behaviors
  • `SkipInternalRebootStart` — tells HCS to not try to internally restart, leaving the slot free for the shim
  • `EnableShimRebootHandler` — gates the shim-side `doHandleReboot` path

Known limitations

  • Stdio doesn't survive the reboot today. New init runs with nil stdio (the "headless" fallback). `docker logs` only shows pre-reboot output. Interactive `docker run -it` sessions drop when the silo dies. Workaround: `docker exec -it` to reconnect, which gets fresh pipes per-invocation.
  • `docker inspect` reports the original PID because containerd caches it from the `TaskCreate` event and we don't have a mechanism to update it. Cosmetic; doesn't affect functionality.
  • WindowsContainerOrchestrator only for now. LCOW / HyperV-isolated / job containers aren't covered.
  • Dev-guard sprawl: 5 guards is too many for ship; should consolidate to one (`Feature_HcsSiloReboot`).

Validated end-to-end

Works on the test VM with the matching HCS-side changes:

  • `docker run -d ... cmd /c "ping -n 999999 127.0.0.1"` as init
  • `docker exec` to write a marker file
  • `docker exec -d ... shutdown /r /t 0` to trigger reboot
  • After ~25s: `docker ps` still shows Up
  • `docker exec` shows the marker file persisted (overlay survived)
  • `docker exec ... tasklist` shows different PIDs (kernel really rebooted)
  • `docker stop` and `docker rm` work cleanly

Not for review yet

Pushing now to make the changes visible alongside the design conversation. Cleanup, dev-guard consolidation, test coverage, and (most importantly) HCS-side stdio preservation would all happen before this is review-ready.

Paul Bozzay added 12 commits April 20, 2026 00:57
Task 1.9 of the container-reboot-v2 plan. Adds internal/devguard package
that reads HKLM\Software\Microsoft\HCS\Dev\Reboot\<Name> DWORDs at runtime,
mirroring the HcsDev::Reboot::* accessors on the HCS C++ side. Five named
guard constants exported (ForceStopForRestart, ExposeRebootNotification,
PassExitStatusJson, SkipInternalRebootStart, EnableShimRebootHandler).

IsEnabled() opens the registry key, reads the DWORD, closes. No caching;
every call is a fresh read so reg flips take effect on the next event.
Missing key, missing value, wrong type, or access-denied all return false.

Three TDD unit tests cover missing key, zero value, and non-zero value.
…ders)

Task 1.10 of the container-reboot-v2 plan. Adds OpenCensus span attributes
along the reboot observation path:

- internal/hcs/system.go::waitBackground — reboot.exit_type (string, empty)
  and reboot.notification_data_bytes (int64, 0). Populated by Stage 2 once
  notificationWatcher parses SystemExitStatus JSON.

- cmd/containerd-shim-runhcs-v1/exec_hcs.go::waitForContainerExit —
  reboot.pending (bool, false). Flipped by Stage 4 when the shim observes
  a Reboot exit_type and sets hcsExec.rebootPending instead of killing init.

- cmd/containerd-shim-runhcs-v1/task_hcs.go::waitInitExit — reboot.pending
  (bool, false). Flipped by Stage 4 when dispatching to handleReboot.

Placeholder values only; this stage introduces no behavior change and
keeps the baseline trace signature consistent with future-populated runs.
Task 2.4 of the container-reboot-v2 plan. Prior to this change the HCS
notification channel was typed chan error — the Win32 callback's
notificationData pointer was silently discarded. Callers observing
hcsNotificationSystemExited could therefore never see the
SystemExitStatus JSON, so ExitType=Reboot was invisible on the shim side.

- Introduce notificationPayload{err,data} struct and retype the channel.
- In notificationWatcher, materialize notificationData (null-terminated
  UTF-16) into payload.data via a new utf16PtrToString helper. Nil pointer
  yields '' data — the common case for non-Exited notifications.
- waithelper.go readers consume payload.err; payload.data is ignored
  here (consumed by System.waitBackground in Task 2.5).

Two TDD unit tests in callback_test.go cover the happy path (JSON
payload round-trips intact) and the nil-data case (benign).
Task 2.5 of the container-reboot-v2 plan.

- Add internal/hcs/exitstatus.go with systemExitStatus struct mirroring
  the HCS schema (Status, ExitType) and parseExitType helper. Unmarshal
  errors propagate; empty/missing payload returns ('', nil) so callers
  don't see spurious errors on non-exited notifications.

- Add exitType + exitTypeMu fields on *System plus an ExitType() getter
  (RLocked). Empty string before exit; 'Reboot' et al once populated.

- Wire into System.waitBackground: peek the SystemExitStatus payload
  ourselves before the existing err-only flow so we capture payload.data
  (the JSON). The peek replaces waitForNotification for this one
  notification type because waitForNotification's select is err-only —
  we'd lose the payload otherwise. System.waitBackground is the sole
  reader of this channel for the compute system's lifetime so the split
  is safe; other waiters go through waitForNotification on other
  notification types. Fallback path preserved for the 'callback context
  gone' edge case.

- Replace the Stage 1 placeholder span attrs (reboot.exit_type='',
  reboot.notification_data_bytes=0) with real values from the parsed
  payload.

Tests: 5 new parseExitType cases covering Reboot, GracefulExit, empty,
malformed JSON (returns err), and missing ExitType field (benign '').
Task 2.6 of the container-reboot-v2 plan. Extends the cow.Container
interface with ExitType() string so callers can observe the parsed
SystemExitStatus.ExitType carried up by *hcs.System.

*hcs.System already implements it (Task 2.5). Stub two other
cow.Container implementers to return '':

- *gcs.Container: talks to the LCOW guest directly, never sees an HCS
  SystemExitStatus. container-reboot-v2 is Argon-only so the LCOW path
  is out of scope; empty string is the correct 'unknown/fallback' answer.
- *jobcontainers.JobContainer: doesn't wrap an HCS compute system at all.

Callers treat empty string as 'unknown, use previous exit-handling
logic', so these stubs preserve existing behavior on non-Argon paths.
Task 2.7 of the container-reboot-v2 plan. When hcsExec.waitForContainerExit
observes the compute-system exit, surface the parsed ExitType via a logrus
Info entry — no behavior change, just a stable observability checkpoint.

Logs any non-empty ExitType, not just Reboot, so the shim trace reports
GracefulExit / UnexpectedExit the same way. Stage 4's handleReboot is
where the Reboot branch finally diverges from teardown; this log stays
useful in production as a compact 'what did HCS tell us' record.
…hook

Add a Reboot-observation point in hcsTask::waitInitExit, gated by
EnableShimRebootHandler. When a silo exits with ExitType=Reboot, emit
a stable Info log and set reboot.pending=true on the waitInitExit span.
No behavior change — teardown still runs — this is the reliable hook
Sub-step B will extend with actual handleReboot logic.

Why here vs hcsExec::waitForContainerExit:
  waitForContainerExit has a select between the container's WaitChannel
  (silo termination) and the init exec's processDone (init process exit).
  For an Argon reboot both fire near-simultaneously and in the Stage 3
  validation runs processDone won the race — meaning the existing Stage
  2 log in exec_hcs.go NEVER fired despite the reboot signal being
  present. waitInitExit runs unconditionally after init.Wait() returns,
  so it's a single, deterministic intercept.

Timing subtlety (debugged in-session):
  cow.Container.ExitType() is only defined AFTER WaitChannel() closes
  (cow.go:101). init.Wait() returns when the init PROCESS exits, but
  *hcs.System.waitBackground (which parses SystemExitStatus JSON into
  ExitType) runs on the system-level exit notification — a separate
  HCS callback. First run returned "" 100% of the time because the
  ExitType read happened ~22ms before waitBackground finished. Fix:
  block on ht.c.WaitChannel() (with 5s timeout warning) before reading
  ExitType.

Verified 2026-04-23 18:33 on reboot-v3:
  Span hcsTask::waitInitExit ... reboot.pending=true
  level=info msg="reboot-v2 Stage 4: would handle reboot here
    (no action; falling through to teardown)" reboot.exit_type=Reboot
…ate is possible

Two-part change, observation-only (no actual restart semantic yet):

B1 - internal/hcs/system.go: cache the hcsDocument on *System at creation
time, expose via System.CreateDocument() as a json.RawMessage. Previously
the document was assembled in hcsoci/create.go, marshaled, and discarded;
now it's retained on the System for later reissue by Sub-step B3's
handleReboot.

B2 - cmd/containerd-shim-runhcs-v1/task_hcs.go: in waitInitExit's Reboot
branch, BEFORE ht.close() (so the WCIFS overlay + HNS endpoint are still
live), run a probeSameIDRecreate that:
  1. Closes the old *hcs.System handle
  2. Calls hcs.CreateComputeSystem with the stashed doc on the same container ID
  3. Calls Start on the new system
  4. Logs each outcome, then Terminate+Wait+Close to clean up so the
     existing teardown path sees an empty slot

The point of the probe is to answer the Sub-step B design question: does
HCS reject same-ID recreate? Can the new silo pick up the old overlay and
HNS endpoint automatically?

Verified 2026-04-23 20:00 on reboot-v3 with all 5 guards on:
  reboot-v2 B2: closing old system handle before recreate probe (doc_bytes=700)
  reboot-v2 B2: CreateComputeSystem SUCCEEDED on same ID; calling Start
  reboot-v2 B2: Start SUCCEEDED — full create+start cycle works on same ID

Both assumptions from the research doc confirmed: (1) HCS accepts the
recreate with zero friction, (2) the overlay layer + HNS endpoint
registered for the container ID are reused by the new silo without
re-running hcsoci.CreateContainer. Sub-step B3 can now wire this into
ht.c / ht.init for an actual transparent restart.
… recreated silo

Extends probeSameIDRecreate: after hcs.CreateComputeSystem + newSys.Start
succeed on the reboot-recreated silo, also spawn a benign init process
via cmd.Cmd (mirroring the hcsExec.startInternal path). Waits for the
probe process to exit, logs the PID and exit code.

Uses a benign spec (cmd /c hostname) instead of ht.taskSpec.Process
because the real task spec on the current test-bed runs `shutdown /r`
and would cascade into an infinite reboot chain if re-executed on the
new silo. B3a is mechanics-only; B3b will use the unmodified spec once
the state-machine swap eliminates the cascade risk.

Verified 2026-04-23 21:16 on reboot-v3:
  reboot-v2 B2: closing old system handle before recreate probe (doc_bytes=700)
  reboot-v2 B2: CreateComputeSystem SUCCEEDED on same ID; calling Start
  reboot-v2 B2: Start SUCCEEDED — full create+start cycle works on same ID
  reboot-v2 B3a: probe init-process spawned probe.pid=2024
  reboot-v2 B3a: probe init-process exited — full recreate+spawn cycle verified
    probe.exit_code=0

The full HCS-API mechanics for transparent restart are now proven:
Close old handle -> CreateComputeSystem (same ID) -> System.Start ->
cmd.Start (init process). Each step logged with unambiguous success
markers. Sub-step B3b is the remaining piece: wire the new System and
new init exec into ht.c and ht.init, suppress ht.close(), so containerd
sees no /tasks/exit event. That's a shim-state-machine change, not an
HCS-API question.
First working transparent restart. On Reboot detection in waitInitExit,
the shim now:
  1. Closes the old *hcs.System handle
  2. Calls hcs.CreateComputeSystem with the cached document on the same ID
  3. Starts the new System
  4. Spawns the original init process (ht.taskSpec.Process) via cmd.Cmd
  5. Swaps ht.c = newSys
  6. Resets hcsExec state in-place under sl lock: c, p, pid, state=Running,
     exitStatus=255, exitedAt=zero, fresh processDone/exited channels +
     fresh sync.Once values
  7. Respawns waitForExit to track the new init process
  8. Returns from waitInitExit WITHOUT calling ht.close(ctx) — no TaskExit
     event published, task logically still Running

Verified 2026-04-23 21:39 on reboot-v3:
  reboot-v2 Stage 4: reboot observed; attempting transparent restart (B3b)
  reboot-v2 B3b: closing old system handle (doc_bytes=700)
  reboot-v2 B3b: new System created on same ID
  reboot-v2 B3b: new System started
  reboot-v2 B3b: new init process spawned new.pid=1848
  reboot-v2 B3b: task state swapped; container logically still Running
  reboot-v2 B3b: transparent restart completed; suppressing teardown

Docker reported the container as "Up About a minute" for the full window
between reboot-handled and our manual cleanup — FIRST TIME the transparent
restart is user-visible end-to-end.

KNOWN LIMITATIONS (Stage 5 cleanup):

* Stdio pipes: oldExec.io's upstream pipes were closed by the original
  init-exit path before our doHandleReboot ran. The new cmd.Cmd tries to
  reuse those closed pipes — immediately gets "file has already been
  closed" on stdout relay. The new init process is effectively blind.
  Fix: reopen the upstream IO pipes via NewUpstreamIO before spawning
  the new init.

* No reboot loop: if the new silo reboots again, we fall through to
  normal exit because waitInitExit already returned. Fix: respawn
  waitInitExit (or restructure as a for-loop) after handleReboot.

* Docker exec / docker rm deadlock: after the first restart, docker
  commands against the container hang. Root cause likely in the closed-
  stdio state or in our respawned waitForExit hitting an invalid IO.
  Needs debug + fix before this is shippable.

* PID visibility: containerd caches the original init PID from the
  TaskCreate event. docker inspect still reports the old PID even after
  successful restart. Cosmetic for now; a /tasks/start republish (or a
  new /tasks/reboot event type) would address it.

probeSameIDRecreate is retained as-is for reference / fallback during
iteration — will be removed once Sub-step C (loop + stdio fix) lands.
…back

Two follow-up fixes on top of B3b's transparent-restart prototype:

1. Reboot loop. After a successful handleReboot, respawn waitInitExit as
   a goroutine so a subsequent in-container reboot is also handled. Each
   cycle spawns a fresh waiter for the next one. The chain terminates
   naturally when the task ends (non-Reboot exit) or when an external
   docker stop/rm drives teardown.

2. Fresh stdio with headless fallback. Original plan: reopen the
   containerd-owned pipes with NewUpstreamIO against the original paths.
   Observed behavior: containerd tears down its server-side pipes when
   the shim's client disconnects during the original init-exit path, so
   NewUpstreamIO fails with "system cannot find the file specified"
   every time. Pragmatic fix: fall back to nil stdio and let the new
   init run headless. The process still runs, docker still sees the
   container as Up, follow-up ops (exec/stop/rm) no longer deadlock.
   Full stdio reattach needs a containerd-side change (or a shim pipe-
   republish protocol) and is out of scope for the prototype.

Also stash req.Stdin/Stdout/Stderr/Terminal on hcsTask at newHcsTask
time so doHandleReboot can re-attempt NewUpstreamIO with the original
paths even though it's expected to fail today.

Verified 2026-04-23 22:18 on reboot-v3, one container lifecycle:
  t=0: docker run servercore cmd /c "start /b shutdown /r & ping -t"
  t=~33s:  reboot cycle 1 — new.pid=5368
  t=~63s:  reboot cycle 2 — new.pid=1180
  t=~93s:  reboot cycle 3 — new.pid=6636
  docker ps: "Up About a minute" throughout
  docker stop: Exited (1067)
  docker rm: clean removal

Remaining B3 gaps are non-deadlocking and mostly cosmetic:
  * Stdio not visible after restart (fundamental — needs containerd change)
  * docker inspect reports the original PID (cached in containerd's task
    state; would need /tasks/start republish or a new /tasks/reboot topic)
  * On shim shutdown the last headless silo may linger briefly (cleanup
    timing; doesn't affect user-facing behavior)
Now that the full transparent-reboot flow (detection -> create same-ID ->
spawn init -> state swap -> reboot loop) is working end-to-end, clean
up the Stage 4 iteration scaffolding:

- Remove probeSameIDRecreate function entirely. It was retained as a
  reference/fallback during iteration but is superseded by doHandleReboot
  and has no callers.

- Collapse "reboot-v2 B3b:" / "reboot-v2 B3c:" log prefixes to just
  "reboot-v2:". The sub-step labels were useful for differentiating probe
  runs during iteration but add noise now that there's a single reboot
  code path.

- Update the doHandleReboot docstring to reflect the final flow (all 9
  steps including fresh stdio + reboot loop) and its actual known gaps
  (stdio reattach, PID cache), removing the "B3c will do this later"
  TODO-style notes that no longer apply.

- Update the caller-site comment in waitInitExit to document that the
  reboot loop is the explicit reason we return without ht.close() — the
  respawned waitInitExit handles any subsequent reboot.

No behavior change. Verified green build (go build -ldflags "-s -w").
Next: redeploy + re-run the reboot cycle test to confirm nothing
regressed, then snapshot.

-143/+44 LOC net.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants