Skip to content

fix(fully-async): respect partial_rollout=False when requeuing ABORTED groups#2092

Open
Kagura-0001 wants to merge 2 commits into
THUDM:mainfrom
Kagura-0001:fix/fully-async-respect-partial-rollout
Open

fix(fully-async): respect partial_rollout=False when requeuing ABORTED groups#2092
Kagura-0001 wants to merge 2 commits into
THUDM:mainfrom
Kagura-0001:fix/fully-async-respect-partial-rollout

Conversation

@Kagura-0001

Copy link
Copy Markdown

Problem

When using fully_async_rollout with partial_rollout=False, ABORTED sample groups are unconditionally requeued back to data_buffer, bypassing the partial_rollout flag entirely.

In the standard (non-async) path, sglang_rollout.py::abort() already enforces this contract:

# sglang_rollout.py:376
if not args.partial_rollout:
    continue  # ABORTED samples are discarded, not collected

_AsyncRolloutWorker._make_done_cb does not apply the same check, so fully-async effectively forces partial_rollout=True behavior regardless of the user's configuration.

Root Cause

When an ABORTED group is requeued with existing sample.tokens (prompt + partial response), the next call to generate receives a non-empty sample.tokens and treats it as a prefix for continuation. With partial_rollout=False and mask_offpolicy_in_partial_rollout=False (the defaults), tokens generated under previous weight versions are appended without masking and participate in the loss.

Observed symptom — a single sample accumulating tokens across three weight versions:

weight_versions     = ["2", "4", "6"]
finish_reasons      = ["abort", "abort", "length"]
per-segment tokens  = [2, 2, 8]

Fix

Guard the requeue with the same partial_rollout flag used in abort():

if any(getattr(s, "status", None) == Sample.Status.ABORTED for s in result):
    if self.args.partial_rollout:  # align with sglang_rollout.py::abort()
        try:
            self.data_buffer.add_samples([result])
        except Exception:  # noqa: BLE001
            logger.exception("fully-async: failed to requeue aborted group")
    return

Behavior after fix:

  • partial_rollout=False (default): ABORTED groups are dropped; the worker tops up from the buffer with a fresh sample, matching standard-path semantics.
  • partial_rollout=True: unchanged — ABORTED groups are still requeued for continuation.

Verification

Run any fully-async training job with partial_rollout unset (default False) and frequent enough weight updates to produce ABORTED samples. Inspect the sample dump:

  • Before fix: samples show multiple entries in weight_versions and finish_reasons starting with "abort"
  • After fix: every completed sample has a single weight_versions entry

liuweisong and others added 2 commits June 16, 2026 15:40
…D groups

In the standard path, sglang_rollout.py::abort() discards ABORTED samples
when partial_rollout=False. _AsyncRolloutWorker._make_done_cb previously
requeued ABORTED groups unconditionally, bypassing this contract and causing
tokens generated under stale weights to enter training without off-policy
masking.

Guard the requeue with the same partial_rollout flag used in abort().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant