Skip to content

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845

Open
rahul2393 wants to merge 12 commits intogoogleapis:mainfrom
rahul2393:endpoint-cooldown-re
Open

feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845
rahul2393 wants to merge 12 commits intogoogleapis:mainfrom
rahul2393:endpoint-cooldown-re

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented Apr 17, 2026

Summary

This PR improves Java Spanner's location-aware bypass routing when routed replicas are overloaded or unavailable, and extends score-based replica selection

The client now:

  • avoids recently overloaded routed endpoints using shared cooldowns
  • records RESOURCE_EXHAUSTED / UNAVAILABLE as EWMA error penalties
  • uses EWMA-based selection for both preferLeader=false and strong preferLeader=true read/query routing when
    operation_uid is available

It also keeps the location-aware read path lock-free via immutable group snapshots.

What changed

  • Added shared channel-level cooldown tracking for routed endpoints that return RESOURCE_EXHAUSTED / UNAVAILABLE, while still keeping request-scoped exclusions for same-logical-request retries.
  • Updated bypass retry behavior so eligible reads/queries can reroute to another replica instead of immediately
    returning to the same failed endpoint.
  • Recorded RESOURCE_EXHAUSTED / UNAVAILABLE as EWMA error penalties for routed replicas, so unhealthy endpoints are deprioritized even after the immediate retry/cooldown window.
  • Extended score-based routing to strong preferLeader=true read/query traffic when operation_uid is present, using leader preference as a bias instead of a hard override.
  • Kept preferLeader=true behavior unchanged for paths without operation_uid such as mutation/commit routing.
  • Refactored KeyRangeCache group state to immutable snapshots and removed per-group synchronization from the routing hot path.

@rahul2393 rahul2393 requested review from a team as code owners April 17, 2026 22:34
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an endpoint cooldown mechanism to handle RESOURCE_EXHAUSTED errors and refactors the KeyRangeCache to use immutable snapshots, replacing per-group locking to improve read performance. The new EndpointOverloadCooldownTracker manages short-lived cooldowns with exponential backoff and jitter, while KeyAwareChannel is updated to exclude endpoints on both RESOURCE_EXHAUSTED and UNAVAILABLE status codes. Feedback is provided to optimize the GroupSnapshot constructor by removing a redundant list copy.

Comment on lines +573 to +577
private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {
this.generation = generation;
this.leaderIndex = leaderIndex;
this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The GroupSnapshot constructor performs a redundant copy of the tablets list. Since the only caller (CachedGroup.update) already creates a new ArrayList, we can wrap it directly in an unmodifiable list to avoid unnecessary allocations.

Suggested change
private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {
this.generation = generation;
this.leaderIndex = leaderIndex;
this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets));
}
private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) {
this.generation = generation;
this.leaderIndex = leaderIndex;
this.tablets = Collections.unmodifiableList(tablets);
}

@rahul2393
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements latency-aware routing for Spanner endpoints by introducing a score-based replica selection mechanism using time-decayed EWMA. Key additions include registries for tracking endpoint latency and inflight requests, a cooldown tracker for overloaded endpoints, and updates to the KeyRangeCache to support score-aware selection. Feedback identifies several high-priority issues in the new static registries, including a memory leak in the latency tracker map due to accumulating operation identifiers, potential key collisions between different client instances sharing a JVM, and a race condition when updating inflight request counts. There is also a recommendation to reduce the maximum size of the request ID cache to prevent excessive memory consumption.

@rahul2393
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a score-aware load balancing and rerouting system for Spanner routed endpoints, introducing components like EndpointLatencyRegistry for tracking latencies and inflight requests, and EndpointOverloadCooldownTracker for managing overloaded replicas. The KeyRangeCache is updated to utilize a "Power of Two" selection strategy based on these metrics. Feedback identifies several improvement opportunities: refining cost calculations to use inflight counts even when latency data is absent, ensuring consistency by adding RESOURCE_EXHAUSTED to retryable codes for streaming SQL requests, and adjusting the EWMA decay logic to prevent score resets during near-simultaneous updates.

@rahul2393 rahul2393 force-pushed the endpoint-cooldown-re branch from a5a6665 to 4912eac Compare April 20, 2026 06:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant