Skip to content

[test] Stabilize flaky e2e tests by replacing hard sleeps with polling loops#561

Merged
moshevayner merged 5 commits intomainfrom
fix-flaky-tests
Apr 21, 2026
Merged

[test] Stabilize flaky e2e tests by replacing hard sleeps with polling loops#561
moshevayner merged 5 commits intomainfrom
fix-flaky-tests

Conversation

@moshevayner
Copy link
Copy Markdown
Member

What

  • Fix get-nb-id.sh to read the NodeBalancer IP directly from status.loadBalancer.ingress[0].ip instead of parsing it from the service hostname DNS label, which was brittle and caused widespread test failures.

  • Replace hard sleep + one-shot checks with polling loops (for i in {1..N}; do ... sleep 10; done) across 15+ tests. Hard sleeps either fail when the system is slow or waste time when it's fast.

  • Fix the old hostname-based NB ID lookup in
    lb-with-proxyprotocol-default-annotation (same root cause as get-nb-id.sh).

  • Fix name: c typo in lb-created-with-reserved-ip-multiple-change-ip.

  • Remove sleep 180 in lb-with-node-addition before the node provisioning poll; extend the loop from 20→30 iterations to keep the same ~10m budget without a blind wait.

Why

These tests were flaky on timing, not because of real CCM bugs. The polling pattern is already established in the codebase. This PR applies it consistently.

Signed-off-by: Moshe Vayner moshe@vayner.me

General:

  • Have you removed all sensitive information, including but not limited to access keys and passwords?
  • Have you checked to ensure there aren't other open or closed Pull Requests for the same bug/feature/question?

Pull Request Guidelines:

  1. Does your submission pass tests?
  2. Have you added tests?
  3. Are you addressing a single feature in this PR?
  4. Are your commits atomic, addressing one change per commit?
  5. Are you following the conventions of the language?
  6. Have you saved your large formatting changes for a different PR, so we can focus on your work?
  7. Have you explained your rationale for why this feature is needed?
  8. Have you linked your PR to an open issue

…lling loops

Signed-off-by: Moshe Vayner <moshe@vayner.me>
@github-actions github-actions Bot added the testing for updates to the testing suite in the changelog. label Apr 17, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.62%. Comparing base (a8ed016) to head (e991f37).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #561   +/-   ##
=======================================
  Coverage   73.62%   73.62%           
=======================================
  Files          19       19           
  Lines        2927     2927           
=======================================
  Hits         2155     2155           
  Misses        523      523           
  Partials      249      249           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Moshe Vayner <moshe@vayner.me>
…hainsaw test

Signed-off-by: Moshe Vayner <moshe@vayner.me>
echo "Nodebalancer ID: $nbid"

for i in {1..20}; do
for i in {1..30}; do
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the large sleep since we are using CAPL to create a new node, and that can take ~2 mins to come up ready. Otherwise, using polling through out the tests cases is great!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I basically kept it at the same length by increasing the counter to 30 from 20. Adding 10 more iterations of 20 seconds each is 200 seconds. I just made it a bit more flexible so that if the CAPL node comes up in less than 180 seconds, we save the time and move on to the next step as soon as everything's ready.
The math I did was that instead of waiting 180 seconds for CAPI, and then looping for 400 seconds max (total of 580 seconds), we just start the loop right away, but with a higher max of 600 seconds total, which even gives us an extra 20 seconds if we need them.
Does that make sense? If you don't think that's the right approach, I can definitely revert this change.
It's just that this test was sometimes failing so I figured I'd give it a shot with being a bit more flexible with the loop.
WDYY?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we definitely still wanna keep the polling. Just start it later.

Reason being that hammering the API when we know the state won't change might not be what we wanna do? We know that a node can take a minimum of 2 mins to come up, so why not sleep for that time and then start polling to reduce API calls?

We are also running other tests in parallel. Potentially running multiple GHA runs in parallel. I think it's better to reduce API calls whenever we can. Reduces our chances of hitting rate limits, timeouts, etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes total sense! I'll revert this one.
Thanks for catching that!

We need this to allow the new node to come up

Signed-off-by: Moshe Vayner <moshe@vayner.me>
curl -w "%{http_code}" already outputs "000" when a connection fails, then exits non-zero. Without error suppression, `set -euo pipefail` exits
the script immediately.
Add `|| true` to suppress the exit code while preserving curl's own `"000"` output, allowing the retry loop to continue.

Signed-off-by: Moshe Vayner <moshe@vayner.me>
@moshevayner moshevayner merged commit f277ff7 into main Apr 21, 2026
12 checks passed
@moshevayner moshevayner deleted the fix-flaky-tests branch April 21, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing for updates to the testing suite in the changelog.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants