Skip to content

[fix](streaming-job) propagate ALTER source/target properties to runtime and persistence#62553

Open
JNSimba wants to merge 5 commits intoapache:masterfrom
JNSimba:fix/source_target_props_restart
Open

[fix](streaming-job) propagate ALTER source/target properties to runtime and persistence#62553
JNSimba wants to merge 5 commits intoapache:masterfrom
JNSimba:fix/source_target_props_restart

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Apr 16, 2026

What problem does this PR solve?

ALTER JOB on a non-TVF streaming CDC job could silently fail in two ways:

  1. Persistence: replayOnUpdated didn't sync sourceProperties / targetProperties, so the values were lost after FE checkpoint/restart or on follower FE.
  2. Runtime: JdbcSourceOffsetProvider held an independent copy of sourceProperties seeded at init time. ALTER on credentials (password, driver_url, etc.) never reached the provider's BE RPCs — the job would keep trying the old credentials.

Fix

  1. replayOnUpdated now copies sourceProperties / targetProperties from the replay job.
  2. Job caches a convertedSourceProperties (transient) that's refreshed on ALTER; all provider and task entry points (fetchMeta / cleanup / createStreamingMultiTblTask) push it to the provider via a new ensureInitialized override.

Tests

  • test_streaming_mysql_job_alter_props_restart_fe.groovy — ALTER props survive FE restart.
  • test_streaming_mysql_job_alter_cred.groovy — ALTER password with wrong credentials pauses the job; ALTER back recovers without FE restart.

Release note

Fix streaming job ALTER JOB source/target properties not taking effect at runtime and being lost after FE checkpoint/restart.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test
    • No need to test
  • Behavior changed:

    • No.
  • Does this need documentation?

    • No.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 16, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/4) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/664) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 20, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • Blocking: StreamingInsertJob.replayOnUpdated() now copies the top-level sourceProperties, but it still leaves the already-instantiated JdbcSourceOffsetProvider with its old cached sourceProperties / snapshotParallelism. JobManager.replayUpdateJob() replays onto the existing in-memory job object, so after follower promotion the source-side RPCs (fetchRemoteMeta, compareOffset, requestTableSplits, initSourceReader) still run with stale JDBC config. ALTERing credentials or snapshot parallelism is therefore still lost on the real CDC path even though SHOW JOBS now prints the new map.

Critical Checkpoints

  • Goal of current task: Partially met. The PR fixes persisted ExecuteSql / top-level property maps, but it does not fully restore altered source-side behavior after replay/failover. The added regression only proves the ExecuteSql representation, not the source RPC path.
  • Change size/focus: Small and focused, but too small to cover the cached provider state that replay also needs to refresh.
  • Concurrency: No new concurrent path is introduced here.
  • Lifecycle / persistence: Not fully correct. Replay is still not equivalent to the master-side state transition because one cached runtime object keeps pre-ALTER source config.
  • Parallel code paths: The TVF path already refreshes ALTER-able source fields in JdbcTvfSourceOffsetProvider.ensureInitialized(), but the non-TVF JDBC path still does not.
  • Test coverage: Incomplete for source-property replay. The regression alters snapshot_split_size only after the snapshot phase has already finished and then mostly asserts ExecuteSql, so it would not catch the stale-provider failover path described above.
  • Observability / performance: No additional concerns noted beyond the correctness issue above.

Requesting changes until the replay path refreshes the JDBC provider state as well.

@JNSimba JNSimba changed the title [fix](streaming-job) sync sourceProperties and targetProperties in replayOnUpdated [fix](streaming-job) propagate ALTER source/target properties to runtime and persistence Apr 20, 2026
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 20, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 2 blocking issues and 1 minor test issue.

Critical checkpoints:

  • Goal of current task: Partially achieved. Persistence replay of sourceProperties/targetProperties is fixed, but runtime ALTER propagation is still incomplete because the BE cdc_client keeps reader/deserializer state cached across tasks. The added tests do not cover that path.
  • Modification size/focus: Reasonably focused, but the runtime fix stops at the FE/provider layer and misses the BE reader lifecycle.
  • Concurrency: No new lock-order or heavy-under-lock problem stood out in the changed paths.
  • Lifecycle/state management: Not fully correct. ALTER JOB can leave sourceProperties partially updated on FE when cert conversion fails, and the BE SourceReader keeps old source-property-derived caches across resume.
  • Configuration items: No new configs.
  • Compatibility: No incompatible protocol or storage-format change identified.
  • Parallel code paths: The non-TVF runtime path is only partially updated; some allowed source-property ALTERs still do not take effect.
  • Special conditional checks: No additional concern beyond the non-atomic source-property update.
  • Test coverage: The new regression tests cover restart persistence and password rotation, but they miss reader-cached source props (for example table.<tbl>.exclude_columns) and failed-ALTER atomicity. One new assertion also compares task counts lexicographically.
  • Test result files: The new .out file looks consistent.
  • Observability: Existing logs look sufficient for these paths.
  • Transaction/persistence: The replay change for sourceProperties/targetProperties looks correct.
  • Data writes/modifications: Runtime write behavior can still use stale source-property-derived caches on the BE side.
  • FE-BE variable passing: FE now sends refreshed props, but the cached BE reader does not fully re-consume them after first initialization.
  • Performance: snapshot_parallelism ALTER is still not fully honored at runtime because the cached BE reader keeps the old poll executor size.

Please address the inline comments before merge.

@JNSimba JNSimba removed the dev/4.0.x label Apr 20, 2026
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 20, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 20, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found 2 blocking issues.

  1. StreamingInsertJob.replayOnUpdated() now persists the updated sourceProperties, but replay still does not invalidate the transient convertedSourceProperties cache. On any FE that has already initialized this job once (for example a former master that later replays this ALTER as a follower), later promotion without restart will still drive fetchMeta() and createStreamingMultiTblTask() with the pre-ALTER converted map.
  2. ALTER JOB ... FROM MYSQL ("snapshot_split_size" = ...) is still a silent runtime no-op. Snapshot splits are materialized once at job creation and then restored from streaming_job_meta; resume/restart never rebuilds them from the altered value, so the new property only changes the displayed/persisted metadata.

Critical checkpoint conclusions:

  • Goal / correctness: The PR fixes restart persistence of sourceProperties / targetProperties and improves runtime propagation on the active FE, but it still misses replay/failover cache invalidation and still accepts at least one source-property ALTER that does not affect runtime.
  • Scope / minimality: The change is focused, but the new FE-side rejection list is still incomplete for source properties that cannot actually take effect on an existing job.
  • Concurrency: I did not find a new lock-order or blocking-under-lock issue in the touched FE path; the main remaining risk is replay/lifecycle state rather than thread safety.
  • Lifecycle / persistence: EditLog write/replay coverage exists for the new maps, but replay is not yet equivalent to master-side runtime state because the transient converted-source cache is left stale.
  • Parallel paths: The non-TVF runtime path is only partially covered; the precomputed snapshot-split path still ignores some ALTERed source properties.
  • Tests: The new regression tests cover FE restart persistence and credential rotation, but there is still no coverage for replay on an already-initialized FE object / failover, and the restart test currently uses a source property whose runtime semantics are still unchanged.
  • Observability: Existing logs are sufficient for the touched path; I do not see a new metrics gap here.

I did not run the full FE / regression test suite in this review environment.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/47) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 10.20% (5/49) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 21, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 21, 2026

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • Blocking: the new "reject silent no-op source keys" validation only covers non-TVF FROM MYSQL/POSTGRES ... TO DATABASE jobs. The parallel cdc_stream(...) SQL path still allows keys such as snapshot_parallelism, snapshot_split_size, and table.<tbl>.exclude_columns, even though those values are also cached only at BE SourceReader / deserializer initialization time and are not refreshed by SQL ALTER. Because the job-driven TVF path reuses the same BE reader in fs_brokers/cdc_client/common/Env.jobContexts and fetchRecordStream() does not close it for numeric job IDs, these ALTERs remain silent no-ops.
  • Medium: replayOnUpdated() now invalidates baseCommand / originTvfProps, but it still leaves the existing encryptedSql cache untouched. Since JobManager.replayUpdateJob() mutates the existing in-memory job object instead of replacing it, a follower/new master can keep showing the pre-ALTER SQL in SHOW JOBS even though executeSql has been replayed.

Critical Checkpoints

  • Goal of current task: Partially met. The PR fixes non-TVF source/target property replay and the runtime credential-refresh path, but it still leaves a parallel TVF ALTER path where first-initialize-only source properties are accepted yet ineffective.
  • Change size/focus: Small and focused.
  • Concurrency: No new lock-order or deadlock issue found in the modified flow.
  • Lifecycle / persistence: Not fully equivalent on replay because the SQL-derived encryptedSql cache is not refreshed with the replayed executeSql.
  • Parallel code paths: Not fully covered. The new no-op-property rejection is implemented only for the non-TVF path; the cdc_stream SQL path still misses the same protection.
  • Test coverage: Improved for non-TVF runtime/persistence, but still missing coverage for SQL/TVF ALTER of first-initialize-only properties and for follower/failover SQL display after replay.
  • Observability / performance: No additional concerns beyond the correctness items above.

}

// Reject keys that the runtime reads only at first initialize and never refreshes,
// so ALTER would be a silent no-op. See JdbcSourceOffsetProvider / DebeziumJsonDeserializer.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new guard only covers the non-TVF FROM ... TO DATABASE path. The parallel cdc_stream(...) SQL path in checkUnmodifiableProperties() still allows snapshot_parallelism, snapshot_split_size, and table.<tbl>.exclude_columns, even though those are also read only during BE reader/deserializer initialization. The job-driven TVF path reuses the same SourceReader in fs_brokers/cdc_client/common/Env.jobContexts, and fetchRecordStream() does not close it for numeric job IDs, so a SQL ALTER of those keys is still a silent no-op there.

}
setExecuteSql(replayJob.getExecuteSql());
// SQL-derived caches must be invalidated together so the next parse uses the replayed SQL.
this.baseCommand = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

baseCommand and originTvfProps are invalidated here, but encryptedSql is another SQL-derived cache on the current in-memory job. Because JobManager.replayUpdateJob() mutates the existing object instead of replacing it, leaving encryptedSql untouched means a follower/new master can keep returning the pre-ALTER SQL from getShowSQL() after replay/failover. This cache should be cleared or copied from replayJob together with executeSql.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 21, 2026

run external

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 21, 2026

run vault_p0

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 21, 2026

run external

JNSimba and others added 5 commits April 22, 2026 10:49
…sistence after FE restart

Regression test to verify that sourceProperties and targetProperties
altered via ALTER JOB survive FE restart. Expected to fail until
StreamingInsertJob.replayOnUpdated() is fixed to sync both property maps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…playOnUpdated

After ALTER JOB changes source/target properties, replayOnUpdated()
did not propagate sprops/tprops into the replaying job. The checkpoint
thread replays journals on its own Env, so the generated image also
lost the ALTERed values — any subsequent FE restart (from image or
EditLog) would revert to the original CREATE-time properties.

For CDC jobs this can cause follower failover to use stale JDBC
credentials or miss updated load options such as max_filter_ratio.

Also tighten the regression test's ALTER JOB to only specify the
changed properties.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ffsetProvider

Non-TVF JDBC jobs kept an independent sourceProperties copy on the
offset provider seeded by initSourceJob, so ALTER JOB credential changes
(e.g. password rotation) never reached the provider's BE RPCs.

Cache the converted form of sourceProperties on the job and push it to
the provider via ensureInitialized on every fetchMeta / cleanup cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rived caches on replay

- AlterJobCommand: reject ALTER on snapshot_split_size (only read by the initial
  splitChunks pass; subsequent restart paths restore persisted splits via
  replayIfNeed and never re-split, so the ALTER would only change the persisted
  metadata, not the running split layout).

- StreamingInsertJob.replayOnUpdated: invalidate the transient
  convertedSourceProperties, baseCommand, and originTvfProps caches so that a
  formerly-promoted FE that is replaying this ALTER as a follower does not keep
  returning the pre-ALTER values if it is promoted back without restart.

- Regression: swap test_streaming_mysql_job_alter_props_restart_fe from
  snapshot_split_size to altering `user`; ALTER the user back to root before
  RESUME so the job can authenticate upstream MySQL post-restart.

- Regression: add snapshot_split_size reject case in
  test_streaming_mysql_job_create_alter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JNSimba JNSimba force-pushed the fix/source_target_props_restart branch from 7438a81 to ba3c76d Compare April 22, 2026 02:49
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 22, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 22, 2026

run external

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 11.48% (7/61) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants