Skip to content

[fix][client] Prevent duplicate ServiceUrlProvider initialization#25899

Merged
lhotari merged 2 commits into
apache:masterfrom
oneby-wang:service-url-provider-init-idempotency
Jun 3, 2026
Merged

[fix][client] Prevent duplicate ServiceUrlProvider initialization#25899
lhotari merged 2 commits into
apache:masterfrom
oneby-wang:service-url-provider-init-idempotency

Conversation

@oneby-wang
Copy link
Copy Markdown
Contributor

Motivation

#25892 fixed a flaky SameAuthParamsLookupAutoClusterFailoverTest by removing an extra manual failover.initialize(client) call from the test.

The root cause of that flakiness was duplicate initialization. PulsarClientBuilder.build() already initializes the configured ServiceUrlProvider through PulsarClientImpl, so calling initialize(client) again starts duplicate background checks for the same provider instance.

This is especially problematic for SameAuthParamsLookupAutoClusterFailover, because each initialize call creates a new broker-service-url-check EventLoopGroup. Multiple checker threads can then mutate the same failover state and produce subtle race conditions that are difficult to diagnose. AutoClusterFailover and ControlledClusterFailover have the same lifecycle risk: duplicate initialization can register duplicate scheduled tasks, and ControlledClusterFailover can also recreate its HTTP client without closing the previous one.

Modifications

  • Make AutoClusterFailover, ControlledClusterFailover, and SameAuthParamsLookupAutoClusterFailover fail fast when initialize(PulsarClient) is called more than once.
  • Remove the duplicate manual initialize call from SameAuthParamsLookupAutoClusterFailoverTest.testAutoClusterFailover because the provider is already initialized by PulsarClientBuilder.build().
  • Add dedicated tests that build a PulsarClient with each provider and then verify a second initialize(client) call throws IllegalStateException.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

The threading model is affected only by preventing duplicate background failover check tasks from being registered for the same ServiceUrlProvider instance.

Copy link
Copy Markdown
Contributor

@void-ptr974 void-ptr974 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a lifecycle fix. If the same ServiceUrlProvider instance is reused to build a second client, the second build now fails with IllegalStateException, but the constructor failure path calls shutdown(), which unconditionally closes conf.getServiceUrlProvider(). That can close the provider still used by the first live client.

Example:

ServiceUrlProvider provider = AutoClusterFailover.builder()
        .primary(primary)
        .secondary(List.of(secondary))
        .failoverDelay(1, TimeUnit.SECONDS)
        .switchBackDelay(1, TimeUnit.SECONDS)
        .build();

PulsarClient client1 = PulsarClient.builder()
        .serviceUrlProvider(provider)
        .build();

PulsarClient.builder()
        .serviceUrlProvider(provider)
        .build(); // fails, then closes provider used by client1

Could we only close the provider on constructor failure if this PulsarClientImpl successfully initialized it?

@oneby-wang
Copy link
Copy Markdown
Contributor Author

Could we only close the provider on constructor failure if this PulsarClientImpl successfully initialized it?

@void-ptr974 Nice catch. I agree with this approach. @lhotari WDYT?

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Jun 2, 2026

Could we only close the provider on constructor failure if this PulsarClientImpl successfully initialized it?

@void-ptr974 Nice catch. I agree with this approach. @lhotari WDYT?

@oneby-wang It would be an improvement to handle this case. In addition, the javadoc of org.apache.pulsar.client.api.ServiceUrlProvider should be improved. There should be explanation that each instance lifecycle is tied to one PulsarClient instance. Similar javadoc should be added to the implementation where this is now enforced.

@oneby-wang
Copy link
Copy Markdown
Contributor Author

/pulsarbot rerun-failure-checks

@oneby-wang
Copy link
Copy Markdown
Contributor Author

oneby-wang commented Jun 3, 2026

It would be an improvement to handle this case.

@lhotari I see, let's get this merged first. I'll create another PR to address the improvement.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Jun 3, 2026

It would be an improvement to handle this case.

@lhotari I see, let's get this merged first. I'll create another PR to address the improvement.

@oneby-wang I think that the javadoc improvement belongs to this PR. This PR makes the implicit contract of the ServiceUrlProvider interface explicit and it's useful to document that in javadoc.

@oneby-wang
Copy link
Copy Markdown
Contributor Author

This PR makes the implicit contract of the ServiceUrlProvider interface explicit and it's useful to document that in javadoc.

@lhotari Addressed.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @oneby-wang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants