Skip to content

[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254

Open
j1wonpark wants to merge 2 commits into
apache:masterfrom
j1wonpark:dra-phase1-config
Open

[AMORO-4253] AIP-5 Phase 1: dynamic allocation configuration and validation#4254
j1wonpark wants to merge 2 commits into
apache:masterfrom
j1wonpark:dra-phase1-config

Conversation

@j1wonpark

Copy link
Copy Markdown
Contributor

Why are the changes needed?

Close #4253.

This is the first implementation phase of AIP-5: Dynamic Resource Allocation for Optimizer (#4191). It lays the configuration foundation that the later scaling phases build on. There is no runtime scaling behavior in this PR — it is purely opt-in configuration plumbing, so it is safe to merge independently and has no effect on existing groups (dynamic-allocation.enabled defaults to false).

Splitting AIP-5 into self-contained PRs follows the design's five linear phases; demand-driven scale-up, idle tracking, scale-down + graceful drain, and the new metrics land in subsequent PRs. The full set of dynamic-allocation.* properties is declared here (rather than drip-fed per phase) so the public configuration surface and its validation are settled in one place: a group created between phases cannot carry an unvalidated DRA property. A few getters (the timeout durations, max-parallelism) therefore have no caller until the phase that consumes them — this is intentional, not dead code.

Brief change log

  • OptimizerProperties: declared the dynamic-allocation.* constants (each with a @since tag) — enabled, min-parallelism, max-parallelism (+ the 1024 hard-cap constant), scheduler-backlog-timeout, sustained-backlog-timeout, executor-idle-timeout (+ its 30s minimum), scale-down-cooldown, drain-timeout. Deprecated the flat min-parallelism in favor of the namespaced dynamic-allocation.min-parallelism; it is still honored as a fallback.

  • DynamicAllocationConfig (new): parse(ResourceGroup) builds a typed, immutable config from the group's properties; validate() enforces the AIP-5 constraints only when DRA is enabledmax-parallelism is mandatory, min-parallelism ≤ max-parallelism ≤ 1024 (an over-limit value is rejected, not silently clamped), executor-idle-timeout ≥ 30s, all durations positive, and DRA is rejected on an externally-registered optimizer (conservative check this phase: the external container, which AMS cannot scale). resolveMinParallelism() centralizes the namespaced → legacy → 0 resolution; a one-off deprecation warning is emitted at config-entry points, not on the keeper hot path.

  • OptimizerGroupController: createResourceGroup / updateResourceGroup now reject an invalid DRA config with HTTP 400 before persisting. (The update path previously had no property-value validation at all.)

  • DefaultOptimizingService:

    • Startup load is fail-safe, never silent: a persisted group with an invalid DRA config no longer crashes AMS — it logs a warning and falls back to DRA-disabled behavior. (The optimizer_group_config_invalid gauge is wired in the observability phase.)
    • getMinParallelism() now delegates to resolveMinParallelism(), preserving the existing lenient behavior while honoring the namespaced key.
    • The keeper's existing min-parallelism auto-reset (after optimizer-group.max-keeping-attempts failed creations) is adjusted for the new property model, per the AIP's Compatibility section:
      • It is skipped when DRA is effectively enabled (opted-in and valid) — DRA owns the group's scale decisions, so the keeper no longer erodes its min-parallelism floor and keeps retrying instead. An invalid config counts as disabled, matching the startup fail-safe.
      • When it does apply, it now writes whichever key the group actually reads (namespaced if present, else the legacy flat key). Previously it always wrote the flat key; for a group using the namespaced property that write was shadowed by resolution order, turning auto-reset into an endless no-op loop (repeated warning + DB update with no effect). This is fixed here.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

    • TestDynamicAllocationConfig: rejection of enabled-without-max-parallelism, max < min, over-1024, sub-30s idle timeout, unparsable duration, zero/non-positive duration, and the external container; correct parsing of a valid config; the namespaced → legacy → 0 resolution order; isEffectivelyEnabled (valid+enabled → true, disabled → false, enabled+invalid → false) and effectiveMinParallelismKey.
    • TestOptimizerGroupController: invalid create/update return 400, valid create succeeds.
    • TestOptimizerGroupKeeper: auto-reset is skipped when DRA is enabled (the floor is preserved); a group using the namespaced key has the reset written to that key and no deprecated flat key is introduced; the existing legacy-group auto-reset scenarios still pass (no regression).
  • Add screenshots for manual tests if appropriate (not applicable — no UI change)

  • Run test locally before making a pull request (affected amoro-common + amoro-ams tests pass — 35 tests green; spotless and checkstyle clean)

Documentation

  • Does this pull request introduce a new feature? (yes — AIP-5 foundation, opt-in and disabled by default)
  • If yes, how is the feature documented? (JavaDocs — @since tags and DynamicAllocationConfig doc comments; the full design is in AIP-5 / [Feature]: Dynamic Resource Allocation for Optimizer #4191, and user-facing configuration docs land with the phase that first exposes runtime behavior)

…dation

First implementation phase of AIP-5 (Dynamic Resource Allocation for
Optimizer, apache#4191). Configuration foundation only; no runtime scaling
behavior. Dynamic allocation is opt-in and disabled by default, so
existing groups are unaffected.

- OptimizerProperties: declare the dynamic-allocation.* constants (with
  @SInCE); deprecate the flat min-parallelism in favor of the namespaced
  dynamic-allocation.min-parallelism, still honored as a fallback.
- DynamicAllocationConfig: parse(ResourceGroup) + validate() enforcing
  the AIP-5 constraints only when enabled (max-parallelism mandatory,
  min <= max <= 1024, executor-idle-timeout >= 30s, positive durations,
  reject the external container). resolveMinParallelism centralizes the
  namespaced -> legacy -> 0 resolution.
- OptimizerGroupController: reject an invalid DRA config with HTTP 400 on
  create/update before persisting.
- DefaultOptimizingService: startup load is fail-safe (invalid config
  logs a warning and falls back to DRA-disabled instead of crashing AMS);
  the keeper min-parallelism auto-reset is skipped when DRA is
  effectively enabled, and otherwise writes the key the group actually
  reads (fixing an endless no-op loop for namespaced groups).

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@j1wonpark

Copy link
Copy Markdown
Contributor Author

Hi @xxubai @czy006 @zhoujinsong — first PR for AIP-5 (#4191), config-only and disabled by default; would appreciate a review on whether the dynamic-allocation.* config surface and validation look right before later phases build on them. Thanks!

@xxubai

xxubai commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Thanks for your contribution @j1wonpark ! Will take a look soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Subtask]: AIP-5 Phase 1 — Dynamic allocation configuration, validation, and min-parallelism deprecation

2 participants