fix(lambda): add post-deregister busy check to prevent terminating active runners by JVenberg · Pull Request #5086 · github-aws-runners/terraform-aws-github-runner

JVenberg · 2026-03-30T15:50:12Z

Summary

The scale-down lambda can terminate an EC2 instance while it's actively running a job. This happens because a job can be assigned to a runner between checking its busy state and calling TerminateInstances.

We hit this in production: a Helm deploy was killed mid-execution when the scale-down lambda terminated the instance 13 seconds after the job started. CloudTrail confirmed the TerminateInstances call came from the scale-down lambda at the exact moment the runner received a shutdown signal.

The race condition

Current flow in removeRunner:

Check GitHub API: "Is this runner busy?" -> false
A job gets assigned to the runner here
Deregister the runner from GitHub
Terminate the EC2 instance
The in-flight job is killed

The fix

Add a post-deregistration busy re-check:

Check busy (fast-path to skip obviously busy runners)
Deregister from GitHub (prevents new job assignment server-side)
Re-check busy (now stable, since no new jobs can be assigned after deregistration)

If the re-check finds the runner busy, we skip termination and let the instance be cleaned up as an orphan once the job finishes.

Why this is safe

Deregistering a runner does not affect in-flight jobs. The runner worker uses job-scoped OAuth credentials from the job message, not the runner registration:

JobRunner.cs lines 80-95: the worker creates its own VssConnection using systemConnection credentials from the job message
The worker never checks runner registration status during execution
Deregistration only affects the listener (no new job pickup), not the worker (current job)

Test plan

All 130 existing tests pass
New test: runner that becomes busy between deregister and re-check is NOT terminated
New test: runner that returns 404 on post-deregister busy check IS terminated (runner fully removed from GitHub)

…tive runners The scale-down lambda had a TOCTOU race condition where a job could be assigned to a runner between checking its busy state and terminating the EC2 instance. This caused in-flight jobs to be killed mid-execution. The fix adds a post-deregistration busy re-check: 1. Check busy (fast-path to skip busy runners) 2. Deregister from GitHub (prevents new job assignment) 3. Re-check busy (now stable since no new jobs can be assigned) If the runner became busy between step 1 and 2, the in-flight job completes using its job-scoped OAuth token and the instance is left for orphan cleanup. Fixes github-aws-runners#5085

github-actions · 2026-06-29T06:13:55Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed if no further activity occurs. Thank you for your contributions.

npwolf · 2026-07-02T18:36:48Z

We independently hit the underlying race condition in production today and posted repro details on the linked issue (#5085 comment). This fix looks correct to us — the reorder (deregister → recheck busy) closes the TOCTOU window since in-flight jobs use job-scoped credentials that don't depend on registration state.

This has been open since March without review — @edersonbrilhante @Brend-Smits @npalm would one of you be able to take a look? Happy to help test if useful.

edersonbrilhante

@npwolf Can you fix the conflicts?

JVenberg requested a review from a team as a code owner March 30, 2026 15:50

github-actions Bot added the Stale label Jun 29, 2026

github-actions Bot removed the Stale label Jul 3, 2026

edersonbrilhante requested changes Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(lambda): add post-deregister busy check to prevent terminating active runners#5086

fix(lambda): add post-deregister busy check to prevent terminating active runners#5086
JVenberg wants to merge 1 commit into
github-aws-runners:mainfrom
JVenberg:fix/scale-down-busy-check-race-condition

JVenberg commented Mar 30, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

npwolf commented Jul 2, 2026

Uh oh!

edersonbrilhante left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JVenberg commented Mar 30, 2026

Summary

The race condition

The fix

Why this is safe

Test plan

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

npwolf commented Jul 2, 2026

Uh oh!

edersonbrilhante left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants