[test] Stabilize flaky e2e tests by replacing hard sleeps with polling loops by moshevayner · Pull Request #561 · linode/linode-cloud-controller-manager

moshevayner · 2026-04-17T17:41:58Z

What

Fix get-nb-id.sh to read the NodeBalancer IP directly from status.loadBalancer.ingress[0].ip instead of parsing it from the service hostname DNS label, which was brittle and caused widespread test failures.
Replace hard sleep + one-shot checks with polling loops (for i in {1..N}; do ... sleep 10; done) across 15+ tests. Hard sleeps either fail when the system is slow or waste time when it's fast.
Fix the old hostname-based NB ID lookup in
lb-with-proxyprotocol-default-annotation (same root cause as get-nb-id.sh).
Fix name: c typo in lb-created-with-reserved-ip-multiple-change-ip.
Remove sleep 180 in lb-with-node-addition before the node provisioning poll; extend the loop from 20→30 iterations to keep the same ~10m budget without a blind wait.

Why

These tests were flaky on timing, not because of real CCM bugs. The polling pattern is already established in the codebase. This PR applies it consistently.

Signed-off-by: Moshe Vayner moshe@vayner.me

General:

Have you removed all sensitive information, including but not limited to access keys and passwords?
Have you checked to ensure there aren't other open or closed Pull Requests for the same bug/feature/question?

Pull Request Guidelines:

Does your submission pass tests?
Have you added tests?
Are you addressing a single feature in this PR?
Are your commits atomic, addressing one change per commit?
Are you following the conventions of the language?
Have you saved your large formatting changes for a different PR, so we can focus on your work?
Have you explained your rationale for why this feature is needed?
Have you linked your PR to an open issue

…lling loops Signed-off-by: Moshe Vayner <moshe@vayner.me>

codecov · 2026-04-17T17:45:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.62%. Comparing base (a8ed016) to head (e991f37).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #561   +/-   ##
=======================================
  Coverage   73.62%   73.62%           
=======================================
  Files          19       19           
  Lines        2927     2927           
=======================================
  Hits         2155     2155           
  Misses        523      523           
  Partials      249      249

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Moshe Vayner <moshe@vayner.me>

…hainsaw test Signed-off-by: Moshe Vayner <moshe@vayner.me>

komer3 · 2026-04-17T21:15:18Z

              echo "Nodebalancer ID: $nbid"

-              for i in {1..20}; do
+              for i in {1..30}; do


We should keep the large sleep since we are using CAPL to create a new node, and that can take ~2 mins to come up ready. Otherwise, using polling through out the tests cases is great!

So, I basically kept it at the same length by increasing the counter to 30 from 20. Adding 10 more iterations of 20 seconds each is 200 seconds. I just made it a bit more flexible so that if the CAPL node comes up in less than 180 seconds, we save the time and move on to the next step as soon as everything's ready.
The math I did was that instead of waiting 180 seconds for CAPI, and then looping for 400 seconds max (total of 580 seconds), we just start the loop right away, but with a higher max of 600 seconds total, which even gives us an extra 20 seconds if we need them.
Does that make sense? If you don't think that's the right approach, I can definitely revert this change.
It's just that this test was sometimes failing so I figured I'd give it a shot with being a bit more flexible with the loop.
WDYY?

Oh we definitely still wanna keep the polling. Just start it later.

Reason being that hammering the API when we know the state won't change might not be what we wanna do? We know that a node can take a minimum of 2 mins to come up, so why not sleep for that time and then start polling to reduce API calls?

We are also running other tests in parallel. Potentially running multiple GHA runs in parallel. I think it's better to reduce API calls whenever we can. Reduces our chances of hitting rate limits, timeouts, etc.

Yeah, that makes total sense! I'll revert this one.
Thanks for catching that!

We need this to allow the new node to come up Signed-off-by: Moshe Vayner <moshe@vayner.me>

curl -w "%{http_code}" already outputs "000" when a connection fails, then exits non-zero. Without error suppression, `set -euo pipefail` exits the script immediately. Add `|| true` to suppress the exit code while preserving curl's own `"000"` output, allowing the retry loop to continue. Signed-off-by: Moshe Vayner <moshe@vayner.me>

fix(test): Stabilize flaky e2e tests by replacing hard sleeps with po…

afe40f9

…lling loops Signed-off-by: Moshe Vayner <moshe@vayner.me>

github-actions Bot added the testing for updates to the testing suite in the changelog. label Apr 17, 2026

moshevayner temporarily deployed to prod April 17, 2026 17:42 — with GitHub Actions Inactive

moshevayner had a problem deploying to prod April 17, 2026 17:42 — with GitHub Actions Failure

[test]: Additional fixes for flaky tests

26dbcdd

Signed-off-by: Moshe Vayner <moshe@vayner.me>

moshevayner had a problem deploying to prod April 17, 2026 18:31 — with GitHub Actions Failure

moshevayner temporarily deployed to prod April 17, 2026 18:31 — with GitHub Actions Inactive

[test] Increase timeout and retry attempts for firewall deletion in c…

ca81688

…hainsaw test Signed-off-by: Moshe Vayner <moshe@vayner.me>

moshevayner temporarily deployed to prod April 17, 2026 19:23 — with GitHub Actions Inactive

moshevayner requested review from eljohnson92 and komer3 April 17, 2026 20:48

komer3 reviewed Apr 17, 2026

View reviewed changes

[test] Re-add the 180 seconds sleep in lb-with-node-addition

77ddc9e

We need this to allow the new node to come up Signed-off-by: Moshe Vayner <moshe@vayner.me>

moshevayner had a problem deploying to prod April 17, 2026 22:17 — with GitHub Actions Failure

moshevayner temporarily deployed to prod April 17, 2026 22:17 — with GitHub Actions Inactive

moshevayner requested a review from komer3 April 17, 2026 22:17

moshevayner had a problem deploying to prod April 17, 2026 23:21 — with GitHub Actions Failure

moshevayner temporarily deployed to prod April 17, 2026 23:21 — with GitHub Actions Inactive

moshevayner force-pushed the fix-flaky-tests branch from 5728cdb to e991f37 Compare April 18, 2026 00:34

moshevayner temporarily deployed to prod April 18, 2026 00:35 — with GitHub Actions Inactive

komer3 approved these changes Apr 21, 2026

View reviewed changes

moshevayner merged commit f277ff7 into main Apr 21, 2026
12 checks passed

moshevayner deleted the fix-flaky-tests branch April 21, 2026 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] Stabilize flaky e2e tests by replacing hard sleeps with polling loops#561

[test] Stabilize flaky e2e tests by replacing hard sleeps with polling loops#561
moshevayner merged 5 commits intomainfrom
fix-flaky-tests

moshevayner commented Apr 17, 2026

Uh oh!

codecov Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

komer3 Apr 17, 2026

Uh oh!

moshevayner Apr 17, 2026

Uh oh!

komer3 Apr 17, 2026

Uh oh!

moshevayner Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

moshevayner commented Apr 17, 2026

What

Why

General:

Pull Request Guidelines:

Uh oh!

codecov Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

komer3 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

moshevayner Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

komer3 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

moshevayner Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Apr 17, 2026 •

edited

Loading