feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845
feat(spanner): add shared endpoint cooldowns for location-aware rerouting#12845rahul2393 wants to merge 12 commits intogoogleapis:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an endpoint cooldown mechanism to handle RESOURCE_EXHAUSTED errors and refactors the KeyRangeCache to use immutable snapshots, replacing per-group locking to improve read performance. The new EndpointOverloadCooldownTracker manages short-lived cooldowns with exponential backoff and jitter, while KeyAwareChannel is updated to exclude endpoints on both RESOURCE_EXHAUSTED and UNAVAILABLE status codes. Feedback is provided to optimize the GroupSnapshot constructor by removing a redundant list copy.
| private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) { | ||
| this.generation = generation; | ||
| this.leaderIndex = leaderIndex; | ||
| this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets)); | ||
| } |
There was a problem hiding this comment.
The GroupSnapshot constructor performs a redundant copy of the tablets list. Since the only caller (CachedGroup.update) already creates a new ArrayList, we can wrap it directly in an unmodifiable list to avoid unnecessary allocations.
| private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) { | |
| this.generation = generation; | |
| this.leaderIndex = leaderIndex; | |
| this.tablets = Collections.unmodifiableList(new ArrayList<>(tablets)); | |
| } | |
| private GroupSnapshot(ByteString generation, int leaderIndex, List<TabletSnapshot> tablets) { | |
| this.generation = generation; | |
| this.leaderIndex = leaderIndex; | |
| this.tablets = Collections.unmodifiableList(tablets); | |
| } |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements latency-aware routing for Spanner endpoints by introducing a score-based replica selection mechanism using time-decayed EWMA. Key additions include registries for tracking endpoint latency and inflight requests, a cooldown tracker for overloaded endpoints, and updates to the KeyRangeCache to support score-aware selection. Feedback identifies several high-priority issues in the new static registries, including a memory leak in the latency tracker map due to accumulating operation identifiers, potential key collisions between different client instances sharing a JVM, and a race condition when updating inflight request counts. There is also a recommendation to reduce the maximum size of the request ID cache to prevent excessive memory consumption.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements a score-aware load balancing and rerouting system for Spanner routed endpoints, introducing components like EndpointLatencyRegistry for tracking latencies and inflight requests, and EndpointOverloadCooldownTracker for managing overloaded replicas. The KeyRangeCache is updated to utilize a "Power of Two" selection strategy based on these metrics. Feedback identifies several improvement opportunities: refining cost calculations to use inflight counts even when latency data is absent, ensuring consistency by adding RESOURCE_EXHAUSTED to retryable codes for streaming SQL requests, and adjusting the EWMA decay logic to prevent score resets during near-simultaneous updates.
a5a6665 to
4912eac
Compare
Summary
This PR improves Java Spanner's location-aware bypass routing when routed replicas are overloaded or unavailable, and extends score-based replica selection
The client now:
operation_uid is available
It also keeps the location-aware read path lock-free via immutable group snapshots.
What changed
returning to the same failed endpoint.