[test] Stabilize flaky e2e tests by replacing hard sleeps with polling loops#561
[test] Stabilize flaky e2e tests by replacing hard sleeps with polling loops#561moshevayner merged 5 commits intomainfrom
Conversation
…lling loops Signed-off-by: Moshe Vayner <moshe@vayner.me>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #561 +/- ##
=======================================
Coverage 73.62% 73.62%
=======================================
Files 19 19
Lines 2927 2927
=======================================
Hits 2155 2155
Misses 523 523
Partials 249 249 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Moshe Vayner <moshe@vayner.me>
…hainsaw test Signed-off-by: Moshe Vayner <moshe@vayner.me>
| echo "Nodebalancer ID: $nbid" | ||
|
|
||
| for i in {1..20}; do | ||
| for i in {1..30}; do |
There was a problem hiding this comment.
We should keep the large sleep since we are using CAPL to create a new node, and that can take ~2 mins to come up ready. Otherwise, using polling through out the tests cases is great!
There was a problem hiding this comment.
So, I basically kept it at the same length by increasing the counter to 30 from 20. Adding 10 more iterations of 20 seconds each is 200 seconds. I just made it a bit more flexible so that if the CAPL node comes up in less than 180 seconds, we save the time and move on to the next step as soon as everything's ready.
The math I did was that instead of waiting 180 seconds for CAPI, and then looping for 400 seconds max (total of 580 seconds), we just start the loop right away, but with a higher max of 600 seconds total, which even gives us an extra 20 seconds if we need them.
Does that make sense? If you don't think that's the right approach, I can definitely revert this change.
It's just that this test was sometimes failing so I figured I'd give it a shot with being a bit more flexible with the loop.
WDYY?
There was a problem hiding this comment.
Oh we definitely still wanna keep the polling. Just start it later.
Reason being that hammering the API when we know the state won't change might not be what we wanna do? We know that a node can take a minimum of 2 mins to come up, so why not sleep for that time and then start polling to reduce API calls?
We are also running other tests in parallel. Potentially running multiple GHA runs in parallel. I think it's better to reduce API calls whenever we can. Reduces our chances of hitting rate limits, timeouts, etc.
There was a problem hiding this comment.
Yeah, that makes total sense! I'll revert this one.
Thanks for catching that!
We need this to allow the new node to come up Signed-off-by: Moshe Vayner <moshe@vayner.me>
curl -w "%{http_code}" already outputs "000" when a connection fails, then exits non-zero. Without error suppression, `set -euo pipefail` exits
the script immediately.
Add `|| true` to suppress the exit code while preserving curl's own `"000"` output, allowing the retry loop to continue.
Signed-off-by: Moshe Vayner <moshe@vayner.me>
5728cdb to
e991f37
Compare
What
Fix
get-nb-id.shto read the NodeBalancer IP directly fromstatus.loadBalancer.ingress[0].ipinstead of parsing it from the service hostname DNS label, which was brittle and caused widespread test failures.Replace hard
sleep+ one-shot checks with polling loops (for i in {1..N}; do ... sleep 10; done) across 15+ tests. Hard sleeps either fail when the system is slow or waste time when it's fast.Fix the old hostname-based NB ID lookup in
lb-with-proxyprotocol-default-annotation(same root cause asget-nb-id.sh).Fix
name: ctypo inlb-created-with-reserved-ip-multiple-change-ip.Remove
sleep 180inlb-with-node-additionbefore the node provisioning poll; extend the loop from 20→30 iterations to keep the same ~10m budget without a blind wait.Why
These tests were flaky on timing, not because of real CCM bugs. The polling pattern is already established in the codebase. This PR applies it consistently.
Signed-off-by: Moshe Vayner moshe@vayner.me
General:
Pull Request Guidelines: