[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254
Open
j1wonpark wants to merge 2 commits into
Open
[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254j1wonpark wants to merge 2 commits into
j1wonpark wants to merge 2 commits into
Conversation
…dation First implementation phase of AIP-5 (Dynamic Resource Allocation for Optimizer, apache#4191). Configuration foundation only; no runtime scaling behavior. Dynamic allocation is opt-in and disabled by default, so existing groups are unaffected. - OptimizerProperties: declare the dynamic-allocation.* constants (with @SInCE); deprecate the flat min-parallelism in favor of the namespaced dynamic-allocation.min-parallelism, still honored as a fallback. - DynamicAllocationConfig: parse(ResourceGroup) + validate() enforcing the AIP-5 constraints only when enabled (max-parallelism mandatory, min <= max <= 1024, executor-idle-timeout >= 30s, positive durations, reject the external container). resolveMinParallelism centralizes the namespaced -> legacy -> 0 resolution. - OptimizerGroupController: reject an invalid DRA config with HTTP 400 on create/update before persisting. - DefaultOptimizingService: startup load is fail-safe (invalid config logs a warning and falls back to DRA-disabled instead of crashing AMS); the keeper min-parallelism auto-reset is skipped when DRA is effectively enabled, and otherwise writes the key the group actually reads (fixing an endless no-op loop for namespaced groups). Signed-off-by: Jiwon Park <jpark92@outlook.kr>
Contributor
Author
|
Hi @xxubai @czy006 @zhoujinsong — first PR for AIP-5 (#4191), config-only and disabled by default; would appreciate a review on whether the dynamic-allocation.* config surface and validation look right before later phases build on them. Thanks! |
Contributor
|
Thanks for your contribution @j1wonpark ! Will take a look soon |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Close #4253.
This is the first implementation phase of AIP-5: Dynamic Resource Allocation for Optimizer (#4191). It lays the configuration foundation that the later scaling phases build on. There is no runtime scaling behavior in this PR — it is purely opt-in configuration plumbing, so it is safe to merge independently and has no effect on existing groups (
dynamic-allocation.enableddefaults tofalse).Splitting AIP-5 into self-contained PRs follows the design's five linear phases; demand-driven scale-up, idle tracking, scale-down + graceful drain, and the new metrics land in subsequent PRs. The full set of
dynamic-allocation.*properties is declared here (rather than drip-fed per phase) so the public configuration surface and its validation are settled in one place: a group created between phases cannot carry an unvalidated DRA property. A few getters (the timeout durations,max-parallelism) therefore have no caller until the phase that consumes them — this is intentional, not dead code.Brief change log
OptimizerProperties: declared thedynamic-allocation.*constants (each with a@sincetag) —enabled,min-parallelism,max-parallelism(+ the1024hard-cap constant),scheduler-backlog-timeout,sustained-backlog-timeout,executor-idle-timeout(+ its30sminimum),scale-down-cooldown,drain-timeout. Deprecated the flatmin-parallelismin favor of the namespaceddynamic-allocation.min-parallelism; it is still honored as a fallback.DynamicAllocationConfig(new):parse(ResourceGroup)builds a typed, immutable config from the group's properties;validate()enforces the AIP-5 constraints only when DRA is enabled —max-parallelismis mandatory,min-parallelism ≤ max-parallelism ≤ 1024(an over-limit value is rejected, not silently clamped),executor-idle-timeout ≥ 30s, all durations positive, and DRA is rejected on an externally-registered optimizer (conservative check this phase: theexternalcontainer, which AMS cannot scale).resolveMinParallelism()centralizes the namespaced → legacy →0resolution; a one-off deprecation warning is emitted at config-entry points, not on the keeper hot path.OptimizerGroupController:createResourceGroup/updateResourceGroupnow reject an invalid DRA config with HTTP 400 before persisting. (The update path previously had no property-value validation at all.)DefaultOptimizingService:optimizer_group_config_invalidgauge is wired in the observability phase.)getMinParallelism()now delegates toresolveMinParallelism(), preserving the existing lenient behavior while honoring the namespaced key.min-parallelismauto-reset (afteroptimizer-group.max-keeping-attemptsfailed creations) is adjusted for the new property model, per the AIP's Compatibility section:min-parallelismfloor and keeps retrying instead. An invalid config counts as disabled, matching the startup fail-safe.How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
TestDynamicAllocationConfig: rejection of enabled-without-max-parallelism,max < min, over-1024, sub-30sidle timeout, unparsable duration, zero/non-positive duration, and theexternalcontainer; correct parsing of a valid config; the namespaced → legacy →0resolution order;isEffectivelyEnabled(valid+enabled → true, disabled → false, enabled+invalid → false) andeffectiveMinParallelismKey.TestOptimizerGroupController: invalid create/update return 400, valid create succeeds.TestOptimizerGroupKeeper: auto-reset is skipped when DRA is enabled (the floor is preserved); a group using the namespaced key has the reset written to that key and no deprecated flat key is introduced; the existing legacy-group auto-reset scenarios still pass (no regression).Add screenshots for manual tests if appropriate (not applicable — no UI change)
Run test locally before making a pull request (affected
amoro-common+amoro-amstests pass — 35 tests green; spotless and checkstyle clean)Documentation
@sincetags andDynamicAllocationConfigdoc comments; the full design is in AIP-5 / [Feature]: Dynamic Resource Allocation for Optimizer #4191, and user-facing configuration docs land with the phase that first exposes runtime behavior)