Recover stuck service operations after transient DB failures#5011
Recover stuck service operations after transient DB failures#5011serdarozerr wants to merge 13 commits into
Conversation
…e operations Introduces a new periodic recovery job that scans permanently failed delayed_jobs and re-enqueues polling for service operations still in progress at the broker. Recovers cases where a transient DB connection error caused the polling job to fail permanently (max_attempts=1) while the broker operation was still running, leaving the service instance stuck in 'in progress' with no active poller.
The previous implementation queried dead delayed_jobs then performed separate lookups per row to find the pollable job, entity, and last operation state. Replace with a single 4-table join across service_instance_operations, service_instances, jobs, and delayed_jobs, filtering all conditions in one query
a3a10fe to
f37a14c
Compare
| @logger ||= Steno.logger('cc.background') | ||
| end | ||
|
|
||
| def batch_size |
There was a problem hiding this comment.
This could be a constant: BATCH_SIZE.
| module Runtime | ||
| class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob | ||
| def perform | ||
| logger.info('Recover halted delayed jobs') |
There was a problem hiding this comment.
I would prefer: "Recovering stuck delayed jobs"
| join(:delayed_jobs, guid: Sequel[:jobs][:delayed_job_guid]). | ||
| where(Sequel[:service_instance_operations][:state] => 'in progress'). | ||
| where(Sequel[:service_instance_operations][:type] => 'create'). | ||
| where { Sequel[:service_instance_operations][:created_at] > cutoff_time }. |
There was a problem hiding this comment.
You can use something similar to FailedJobsCleanup (only DB time is used):
where(Sequel.lit("#{Sequel[:service_instance_operations][:created_at]} > CURRENT_TIMESTAMP - INTERVAL '?' SECOND", default_maximum_duration_seconds.to_i))
| def max_attempts | ||
| 1 | ||
| end | ||
|
|
There was a problem hiding this comment.
Please also add a job_name_in_configuration method.
| # unwrap the serialized handler and re-enqueue via the reoccurring job's enqueue_next_job method | ||
| inner_job = Jobs::Enqueuer.unwrap_job(delayed.payload_object) | ||
| inner_job.send(:enqueue_next_job, pjob) | ||
| end |
There was a problem hiding this comment.
I think it would be good to add an explicit log for every re-enqueued job.
| end | ||
|
|
||
| def logger | ||
| @logger ||= Steno.logger('cc.background') |
There was a problem hiding this comment.
The logger should be more specific, e.g. cc.background.delayed-jobs-recover.
| PollableJobModel.db.transaction do | ||
| pjob = PollableJobModel.where(guid: pollable_guid, | ||
| delayed_job_guid: delayed.guid). | ||
| for_update.first |
There was a problem hiding this comment.
Maybe add skip_locked - so if two processes try to lock the same row, the first one succeeds and the second one does nothing.
| where(Sequel[:service_instance_operations][:state] => 'in progress'). | ||
| where(Sequel[:service_instance_operations][:type] => 'create'). | ||
| where { Sequel[:service_instance_operations][:created_at] > cutoff_time }. | ||
| where(Sequel[:jobs][:state] => [PollableJobModel::POLLING_STATE, PollableJobModel::FAILED_STATE]). |
There was a problem hiding this comment.
After giving this some additional thoughts, I think that we should not take failed jobs into account here. This is a terminal state that we already exposed to the user - bringing a "dead job back to life" would violate user expectations.
What we might want to have in addition is triggering orphan mitigation for the combination pollable job is failed and service instance operation is in progress.
|
When focusing on failed jobs where the pollable job is still |
| # Test up migration | ||
| expect(db.indexes(:jobs)).not_to include(:jobs_operation_state_index) | ||
| expect { Sequel::Migrator.run(db, migrations_path, target: current_migration_index, allow_missing_migration_files: true) }.not_to raise_error | ||
| expect(db.indexes(:jobs)).to include(:jobs_operation_state_index) |
There was a problem hiding this comment.
This should also work for postgres. Then you could get rid of the if.
| # pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set. | ||
| # Override individual parameters to break a single filter and test exclusion. | ||
| def make_stuck_scenario( | ||
| sio_state: 'in progress', |
There was a problem hiding this comment.
Very Minor: I prefer explicit names like service_instance_state.
| module VCAP::CloudController | ||
| module Jobs | ||
| module Runtime | ||
| class DelayedJobsRecover < VCAP::CloudController::Jobs::CCJob |
There was a problem hiding this comment.
How about calling it DelayedServiceJobsRecover? Something to make clear that this is only for service related jobs.
| # All filter conditions are satisfied: sio is in progress/create/within cutoff, | ||
| # pjob is FAILED with operation=service_instance.create, delayed_job has failed_at set. | ||
| # Override individual parameters to break a single filter and test exclusion. | ||
| def make_stuck_scenario( |
There was a problem hiding this comment.
Minor: maybe prepare_stuck_service_instance fits better?
…failed polling jobs When a CC polling job permanently fails due to a transient error (e.g. a brief DB connection flip), the client is left with no path to a final state: the pollable job shows FAILED while the service instance operation remains stuck 'in progress' indefinitely. Previously this was addressed by reenqueuing the delayed job, but that approach was fragile and incomplete. This cleanup job detects stuck create operations whose polling job has permanently failed (delayed_jobs.failed_at IS NOT NULL) and resolves them by marking the operation and pollable job as failed and triggering OrphanMitigator to deprovision any broker-side resource, giving clients a definitive final state. Extends coverage to service bindings and service keys. Renames the class and file from DelayedJobsRecover to ServiceOperationsCreateInProgressCleanup to reflect the correct scope.
Problem
When CCDB becomes temporarily unreachable during a service broker polling cycle, the polling job fails permanently even though the broker is still processing the operation. This leaves the service instance stuck:
last_operation.state = 'in progress'(broker still working)pollable_job.state = FAILED(CC gave up)422 "operation in progress"Solution
Adds a new periodic scheduled job
DelayedJobsRecoverthat scansdelayed_jobsfor permanently failed rows and re-enqueues polling for any broker operation still in progress.Detection chain:
delayed_jobs.failed_at IS NOT NULL→ linkedpollable_jobin POLLING/FAILED state → service instancelast_operation.state = 'in progress'Recovery:
Deserializes the original handler from the dead
delayed_job(preserving@start_time,@maximum_duration,@first_time = false) and callsenqueue_next_job, the same path used by normal poll cycles. A row-levelFOR UPDATElock withdelayed_job_guidre-verification prevents double recovery when multiple CC instances run concurrently.I have reviewed the contributing guide
I have viewed, signed, and submitted the Contributor License Agreement
I have made this pull request to the
mainbranchI have run all the unit tests using
bundle exec rakeI have run CF Acceptance Tests