Skip to content

fix: Manage OIDC admin password secret via cluster_resources#919

Open
dervoeti wants to merge 3 commits intomainfrom
fix/test-wait-for-restarter-rollout
Open

fix: Manage OIDC admin password secret via cluster_resources#919
dervoeti wants to merge 3 commits intomainfrom
fix/test-wait-for-restarter-rollout

Conversation

@dervoeti
Copy link
Copy Markdown
Member

Description

The oidc-opa kuttl test fails consistently for all NiFi 2.x variants because the NiFi pod gets restarted shortly after becoming ready.

The OIDC admin password secret was created directly client.create(). This caused a problem:
The commons-op restarter mutating webhook could not see the secret when first computing annotations for the StatefulSet, producing incomplete restarter annotations. The restart controller then detected the missing annotation and patched the StatefulSet, triggering an unnecessary pod restart. In slow CI environments (AKS), the restarted pod took over 5 minutes to come back, exceeding the test's 300s timeout.
So the test proceeded, because the replica was shortly ready, but was then restarted by the restart controller. The restart took quite long in CI and exceeded the test timeout.

We now build the OIDC admin password secret with proper labels and owner references, and apply it through cluster_resources.add() like other managed resources, which solves the problem by preventing the unnecessary restart.

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes

Author

  • Changes are OpenShift compatible
  • CRD changes approved
  • CRD documentation for all fields, following the style guide.
  • Helm chart can be installed and deployed operator works
  • Integration tests passed (for non trivial changes)
  • Changes need to be "offline" compatible
  • Links to generated (nightly) docs added
  • Release note snippet added

Reviewer

  • Code contains useful comments
  • Code contains useful logging statements
  • (Integration-)Test cases added
  • Documentation added or updated. Follows the style guide.
  • Changelog updated
  • Cargo.toml only contains references to git tags (not specific commits or branches)

Acceptance

  • Feature Tracker has been updated
  • Proper release label has been added
  • Links to generated (nightly) docs added
  • Release note snippet added
  • Add type/deprecation label & add to the deprecation schedule
  • Add type/experimental label & add to the experimental features tracker

@dervoeti dervoeti self-assigned this Apr 10, 2026
@dervoeti dervoeti force-pushed the fix/test-wait-for-restarter-rollout branch from ff14bbd to 8a68412 Compare April 10, 2026 17:48
@dervoeti dervoeti moved this to Development: Waiting for Review in Stackable Engineering Apr 10, 2026
@razvan razvan self-requested a review April 17, 2026 12:32
@razvan razvan moved this from Development: Waiting for Review to Development: In Review in Stackable Engineering Apr 17, 2026
@sbernauer sbernauer self-requested a review April 17, 2026 13:49
@sbernauer
Copy link
Copy Markdown
Member

Thanks for the detailed report!
But I'm a bit confused, as the client.create() is before the cluster_resources.apply(), isn't it?
So aren't we creating it now later on with this PR?

The new code always reads in the secret to copy it and to write it out again, which looks a bit silly and actually causes many, many more Secret generations.
So actually doesn't the new code produce more generations in comparison to an ideal code, which only creates the Secret once (with generation 0)?

BTW, we added a shared function to op-rs in stackabletech/operator-rs#1187, which we could use for all this sort of use-cases. Either that is broken (than we should fix it) or it is already there.
@razvan I'm glad we added exactly that :)

I'm only on a train right now, but I would be interested in deeper understanding what exactly is the problem, as I fail to see how

  1. The Secret is created after the StatefulSet
  2. How the Secret ends up with a generation > 1

@razvan
Copy link
Copy Markdown
Member

razvan commented Apr 17, 2026

Without having seen the comment from @sbernauer I allowed myself a little refactoring 66a8d96

@dervoeti
Copy link
Copy Markdown
Member Author

Thanks for the detailed report! But I'm a bit confused, as the client.create() is before the cluster_resources.apply(), isn't it? So aren't we creating it now later on with this PR?

The new code always reads in the secret to copy it and to write it out again, which looks a bit silly and actually causes many, many more Secret generations. So actually doesn't the new code produce more generations in comparison to an ideal code, which only creates the Secret once (with generation 0)?

BTW, we added a shared function to op-rs in stackabletech/operator-rs#1187, which we could use for all this sort of use-cases. Either that is broken (than we should fix it) or it is already there. @razvan I'm glad we added exactly that :)

I'm only on a train right now, but I would be interested in deeper understanding what exactly is the problem, as I fail to see how

  1. The Secret is created after the StatefulSet
  2. How the Secret ends up with a generation > 1

I'm leaving for vacation soon so didn't have time to dig into this deeper, just a few notes:

  • Server side apply on secrets should be idempotent (no new generation if it didn't change), but not ideal, yes
  • I'm not 100% sure about the details of the race condition tbh, I had the suspicion that commons-op was missing the secret when it was created via client.create and adding it to cluster_resources would be the proper way, I tested it a couple of times and at least in my tests it fixed the problem. This might need further debugging.
  • I only skimmed feat: Add helper function to create random Secrets operator-rs#1187 but it sounds very useful for this case, in general feel free to refactor / take over this PR
  • I believe the main problem for the test failure (restart controller kicking in unnecessarily and pod restart takes long in CI) is true, but there might be a better way to fix it properly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Development: In Review

Development

Successfully merging this pull request may close these issues.

3 participants