fix(fully-async): respect partial_rollout=False when requeuing ABORTED groups#2092
Open
Kagura-0001 wants to merge 2 commits into
Open
fix(fully-async): respect partial_rollout=False when requeuing ABORTED groups#2092Kagura-0001 wants to merge 2 commits into
Kagura-0001 wants to merge 2 commits into
Conversation
…D groups In the standard path, sglang_rollout.py::abort() discards ABORTED samples when partial_rollout=False. _AsyncRolloutWorker._make_done_cb previously requeued ABORTED groups unconditionally, bypassing this contract and causing tokens generated under stale weights to enter training without off-policy masking. Guard the requeue with the same partial_rollout flag used in abort().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When using
fully_async_rolloutwithpartial_rollout=False, ABORTED sample groups are unconditionally requeued back todata_buffer, bypassing thepartial_rolloutflag entirely.In the standard (non-async) path,
sglang_rollout.py::abort()already enforces this contract:_AsyncRolloutWorker._make_done_cbdoes not apply the same check, so fully-async effectively forcespartial_rollout=Truebehavior regardless of the user's configuration.Root Cause
When an ABORTED group is requeued with existing
sample.tokens(prompt + partial response), the next call togeneratereceives a non-emptysample.tokensand treats it as a prefix for continuation. Withpartial_rollout=Falseandmask_offpolicy_in_partial_rollout=False(the defaults), tokens generated under previous weight versions are appended without masking and participate in the loss.Observed symptom — a single sample accumulating tokens across three weight versions:
Fix
Guard the requeue with the same
partial_rolloutflag used inabort():Behavior after fix:
partial_rollout=False(default): ABORTED groups are dropped; the worker tops up from the buffer with a fresh sample, matching standard-path semantics.partial_rollout=True: unchanged — ABORTED groups are still requeued for continuation.Verification
Run any fully-async training job with
partial_rolloutunset (defaultFalse) and frequent enough weight updates to produce ABORTED samples. Inspect the sample dump:weight_versionsandfinish_reasonsstarting with"abort"weight_versionsentry