[fix](streaming-job) propagate ALTER source/target properties to runtime and persistence#62553
[fix](streaming-job) propagate ALTER source/target properties to runtime and persistence#62553JNSimba wants to merge 5 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Findings
- Blocking:
StreamingInsertJob.replayOnUpdated()now copies the top-levelsourceProperties, but it still leaves the already-instantiatedJdbcSourceOffsetProviderwith its old cachedsourceProperties/snapshotParallelism.JobManager.replayUpdateJob()replays onto the existing in-memory job object, so after follower promotion the source-side RPCs (fetchRemoteMeta,compareOffset,requestTableSplits,initSourceReader) still run with stale JDBC config. ALTERing credentials or snapshot parallelism is therefore still lost on the real CDC path even thoughSHOW JOBSnow prints the new map.
Critical Checkpoints
- Goal of current task: Partially met. The PR fixes persisted
ExecuteSql/ top-level property maps, but it does not fully restore altered source-side behavior after replay/failover. The added regression only proves theExecuteSqlrepresentation, not the source RPC path. - Change size/focus: Small and focused, but too small to cover the cached provider state that replay also needs to refresh.
- Concurrency: No new concurrent path is introduced here.
- Lifecycle / persistence: Not fully correct. Replay is still not equivalent to the master-side state transition because one cached runtime object keeps pre-ALTER source config.
- Parallel code paths: The TVF path already refreshes ALTER-able source fields in
JdbcTvfSourceOffsetProvider.ensureInitialized(), but the non-TVF JDBC path still does not. - Test coverage: Incomplete for source-property replay. The regression alters
snapshot_split_sizeonly after the snapshot phase has already finished and then mostly assertsExecuteSql, so it would not catch the stale-provider failover path described above. - Observability / performance: No additional concerns noted beyond the correctness issue above.
Requesting changes until the replay path refreshes the JDBC provider state as well.
|
/review |
There was a problem hiding this comment.
I found 2 blocking issues and 1 minor test issue.
Critical checkpoints:
- Goal of current task: Partially achieved. Persistence replay of
sourceProperties/targetPropertiesis fixed, but runtime ALTER propagation is still incomplete because the BE cdc_client keeps reader/deserializer state cached across tasks. The added tests do not cover that path. - Modification size/focus: Reasonably focused, but the runtime fix stops at the FE/provider layer and misses the BE reader lifecycle.
- Concurrency: No new lock-order or heavy-under-lock problem stood out in the changed paths.
- Lifecycle/state management: Not fully correct.
ALTER JOBcan leavesourcePropertiespartially updated on FE when cert conversion fails, and the BESourceReaderkeeps old source-property-derived caches across resume. - Configuration items: No new configs.
- Compatibility: No incompatible protocol or storage-format change identified.
- Parallel code paths: The non-TVF runtime path is only partially updated; some allowed source-property ALTERs still do not take effect.
- Special conditional checks: No additional concern beyond the non-atomic source-property update.
- Test coverage: The new regression tests cover restart persistence and password rotation, but they miss reader-cached source props (for example
table.<tbl>.exclude_columns) and failed-ALTER atomicity. One new assertion also compares task counts lexicographically. - Test result files: The new
.outfile looks consistent. - Observability: Existing logs look sufficient for these paths.
- Transaction/persistence: The replay change for
sourceProperties/targetPropertieslooks correct. - Data writes/modifications: Runtime write behavior can still use stale source-property-derived caches on the BE side.
- FE-BE variable passing: FE now sends refreshed props, but the cached BE reader does not fully re-consume them after first initialization.
- Performance:
snapshot_parallelismALTER is still not fully honored at runtime because the cached BE reader keeps the old poll executor size.
Please address the inline comments before merge.
|
/review |
|
run buildall |
There was a problem hiding this comment.
I found 2 blocking issues.
StreamingInsertJob.replayOnUpdated()now persists the updatedsourceProperties, but replay still does not invalidate the transientconvertedSourcePropertiescache. On any FE that has already initialized this job once (for example a former master that later replays this ALTER as a follower), later promotion without restart will still drivefetchMeta()andcreateStreamingMultiTblTask()with the pre-ALTER converted map.ALTER JOB ... FROM MYSQL ("snapshot_split_size" = ...)is still a silent runtime no-op. Snapshot splits are materialized once at job creation and then restored fromstreaming_job_meta; resume/restart never rebuilds them from the altered value, so the new property only changes the displayed/persisted metadata.
Critical checkpoint conclusions:
- Goal / correctness: The PR fixes restart persistence of
sourceProperties/targetPropertiesand improves runtime propagation on the active FE, but it still misses replay/failover cache invalidation and still accepts at least one source-property ALTER that does not affect runtime. - Scope / minimality: The change is focused, but the new FE-side rejection list is still incomplete for source properties that cannot actually take effect on an existing job.
- Concurrency: I did not find a new lock-order or blocking-under-lock issue in the touched FE path; the main remaining risk is replay/lifecycle state rather than thread safety.
- Lifecycle / persistence: EditLog write/replay coverage exists for the new maps, but replay is not yet equivalent to master-side runtime state because the transient converted-source cache is left stale.
- Parallel paths: The non-TVF runtime path is only partially covered; the precomputed snapshot-split path still ignores some ALTERed source properties.
- Tests: The new regression tests cover FE restart persistence and credential rotation, but there is still no coverage for replay on an already-initialized FE object / failover, and the restart test currently uses a source property whose runtime semantics are still unchanged.
- Observability: Existing logs are sufficient for the touched path; I do not see a new metrics gap here.
I did not run the full FE / regression test suite in this review environment.
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
/review |
|
run buildall |
There was a problem hiding this comment.
Findings
- Blocking: the new "reject silent no-op source keys" validation only covers non-TVF
FROM MYSQL/POSTGRES ... TO DATABASEjobs. The parallelcdc_stream(...)SQL path still allows keys such assnapshot_parallelism,snapshot_split_size, andtable.<tbl>.exclude_columns, even though those values are also cached only at BESourceReader/ deserializer initialization time and are not refreshed by SQL ALTER. Because the job-driven TVF path reuses the same BE reader infs_brokers/cdc_client/common/Env.jobContextsandfetchRecordStream()does not close it for numeric job IDs, these ALTERs remain silent no-ops. - Medium:
replayOnUpdated()now invalidatesbaseCommand/originTvfProps, but it still leaves the existingencryptedSqlcache untouched. SinceJobManager.replayUpdateJob()mutates the existing in-memory job object instead of replacing it, a follower/new master can keep showing the pre-ALTER SQL inSHOW JOBSeven thoughexecuteSqlhas been replayed.
Critical Checkpoints
- Goal of current task: Partially met. The PR fixes non-TVF source/target property replay and the runtime credential-refresh path, but it still leaves a parallel TVF ALTER path where first-initialize-only source properties are accepted yet ineffective.
- Change size/focus: Small and focused.
- Concurrency: No new lock-order or deadlock issue found in the modified flow.
- Lifecycle / persistence: Not fully equivalent on replay because the SQL-derived
encryptedSqlcache is not refreshed with the replayedexecuteSql. - Parallel code paths: Not fully covered. The new no-op-property rejection is implemented only for the non-TVF path; the
cdc_streamSQL path still misses the same protection. - Test coverage: Improved for non-TVF runtime/persistence, but still missing coverage for SQL/TVF ALTER of first-initialize-only properties and for follower/failover SQL display after replay.
- Observability / performance: No additional concerns beyond the correctness items above.
| } | ||
|
|
||
| // Reject keys that the runtime reads only at first initialize and never refreshes, | ||
| // so ALTER would be a silent no-op. See JdbcSourceOffsetProvider / DebeziumJsonDeserializer. |
There was a problem hiding this comment.
This new guard only covers the non-TVF FROM ... TO DATABASE path. The parallel cdc_stream(...) SQL path in checkUnmodifiableProperties() still allows snapshot_parallelism, snapshot_split_size, and table.<tbl>.exclude_columns, even though those are also read only during BE reader/deserializer initialization. The job-driven TVF path reuses the same SourceReader in fs_brokers/cdc_client/common/Env.jobContexts, and fetchRecordStream() does not close it for numeric job IDs, so a SQL ALTER of those keys is still a silent no-op there.
| } | ||
| setExecuteSql(replayJob.getExecuteSql()); | ||
| // SQL-derived caches must be invalidated together so the next parse uses the replayed SQL. | ||
| this.baseCommand = null; |
There was a problem hiding this comment.
baseCommand and originTvfProps are invalidated here, but encryptedSql is another SQL-derived cache on the current in-memory job. Because JobManager.replayUpdateJob() mutates the existing object instead of replacing it, leaving encryptedSql untouched means a follower/new master can keep returning the pre-ALTER SQL from getShowSQL() after replay/failover. This cache should be cleared or copied from replayJob together with executeSql.
|
run external |
|
run vault_p0 |
|
run external |
…sistence after FE restart Regression test to verify that sourceProperties and targetProperties altered via ALTER JOB survive FE restart. Expected to fail until StreamingInsertJob.replayOnUpdated() is fixed to sync both property maps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…playOnUpdated After ALTER JOB changes source/target properties, replayOnUpdated() did not propagate sprops/tprops into the replaying job. The checkpoint thread replays journals on its own Env, so the generated image also lost the ALTERed values — any subsequent FE restart (from image or EditLog) would revert to the original CREATE-time properties. For CDC jobs this can cause follower failover to use stale JDBC credentials or miss updated load options such as max_filter_ratio. Also tighten the regression test's ALTER JOB to only specify the changed properties. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ffsetProvider Non-TVF JDBC jobs kept an independent sourceProperties copy on the offset provider seeded by initSourceJob, so ALTER JOB credential changes (e.g. password rotation) never reached the provider's BE RPCs. Cache the converted form of sourceProperties on the job and push it to the provider via ensureInitialized on every fetchMeta / cleanup cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rived caches on replay - AlterJobCommand: reject ALTER on snapshot_split_size (only read by the initial splitChunks pass; subsequent restart paths restore persisted splits via replayIfNeed and never re-split, so the ALTER would only change the persisted metadata, not the running split layout). - StreamingInsertJob.replayOnUpdated: invalidate the transient convertedSourceProperties, baseCommand, and originTvfProps caches so that a formerly-promoted FE that is replaying this ALTER as a follower does not keep returning the pre-ALTER values if it is promoted back without restart. - Regression: swap test_streaming_mysql_job_alter_props_restart_fe from snapshot_split_size to altering `user`; ALTER the user back to root before RESUME so the job can authenticate upstream MySQL post-restart. - Regression: add snapshot_split_size reject case in test_streaming_mysql_job_create_alter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7438a81 to
ba3c76d
Compare
|
run buildall |
|
run external |
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
ALTER JOB on a non-TVF streaming CDC job could silently fail in two ways:
replayOnUpdateddidn't syncsourceProperties/targetProperties, so the values were lost after FE checkpoint/restart or on follower FE.JdbcSourceOffsetProviderheld an independent copy ofsourcePropertiesseeded at init time. ALTER on credentials (password, driver_url, etc.) never reached the provider's BE RPCs — the job would keep trying the old credentials.Fix
replayOnUpdatednow copiessourceProperties/targetPropertiesfrom the replay job.convertedSourceProperties(transient) that's refreshed on ALTER; all provider and task entry points (fetchMeta/cleanup/createStreamingMultiTblTask) push it to the provider via a newensureInitializedoverride.Tests
test_streaming_mysql_job_alter_props_restart_fe.groovy— ALTER props survive FE restart.test_streaming_mysql_job_alter_cred.groovy— ALTER password with wrong credentials pauses the job; ALTER back recovers without FE restart.Release note
Fix streaming job ALTER JOB source/target properties not taking effect at runtime and being lost after FE checkpoint/restart.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)