[PD] Fix PD interaction and error response#7500
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在改进 Splitwise(Prefill/Decode 分离)模式下的 PD 交互与错误返回:在发生 decode 侧资源不足/抢占等情况时,能更一致地向上游传递“PD Error”并尝试在 Router 侧做重试与更友好的错误响应。
Changes:
- Router 增加 splitwise 模式下的 preempt 重试能力,并新增对应 CLI 参数。
- PD 链路中统一/增强错误透传:decode->prefill 的 cache_sync 发送逻辑、以及引擎侧/输出侧错误码与错误文案。
- OpenAI 协议层扩展 finish_reason,API 层在错误响应时尽量返回已生成内容并标记 pd_reschedule。
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/splitwise/splitwise_connector.py | 调整 decode 向 prefill 发送 cache_sync 的聚合/发送逻辑,扩大错误通知覆盖面。 |
| fastdeploy/router/router.py | 增加 preempt 重试参数与重试主流程(含可选排除上次 decode 实例)。 |
| fastdeploy/output/token_processor.py | prefill 发送 cache 失败时的错误码与错误信息调整(引入 “PD Error” 文案)。 |
| fastdeploy/input/base_processor.py | 遇到错误响应时跳过 token 解码,直接上抛给上游处理。 |
| fastdeploy/entrypoints/openai/serving_chat.py | 非流式错误场景下补齐 outputs 并返回已生成文本;新增 pd_reschedule finish_reason 判定。 |
| fastdeploy/entrypoints/openai/protocol.py | 扩展 OpenAI 协议 finish_reason 可选值:pd_reschedule。 |
| fastdeploy/engine/common_engine.py | PD 相关错误日志/错误响应文案与错误码调整(含 preempted 场景)。 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7500 +/- ##
==========================================
Coverage ? 72.23%
==========================================
Files ? 419
Lines ? 57845
Branches ? 9072
==========================================
Hits ? 41785
Misses ? 13210
Partials ? 2850
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| raise ValueError("{}".format(data["error_msg"])) | ||
| idx = int(data["request_id"].split("_")[-1]) | ||
| # api_server_logger.debug(f"Client {request_id} received: {data}") | ||
| if data.get("error_code", 200) != 200: |
There was a problem hiding this comment.
- 此处直接解码成文本有风险,在正常的解码逻辑中,会处理乱码(即单独一个token解码为乱码,连续解码才正常)
- 只改动了serving_chat,completion接口没适配
建议此处不用解码,而是直接返回
- 标识符表明此请求是重调度,应该finish_reason可以标识
- 直接不做解码返回,即text="",增加返回completion_token_ids
- Router模块在收到对应返回时
-
- 生成新的请求(结构体内容与原请求一致,不管是chat或者completion)
-
- 请求中增加字段generated_token_ids(内容赋值为收到的completion_token_ids,目前所有多模、含内部模型已支持,开源模型@ liyukun 待会儿提上来 )
这样两个接口都兼容,同时复用内部原有逻辑
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
fastdeploy/engine/sched/resource_manager_v1.py:1520
- preallocate_resource_in_d 不再把 request 写入 tasks_list,而 add_prefilled_request 才写入。这样 decode 侧在“已分配 block 但尚未收到 prefill 首 token”阶段,update_metrics/available_gpu_block_num 等统计会漏算这些已占用的 block(因为当前实现只从 tasks_list 收集 block_tables),可能导致监控指标显著偏乐观,排查资源问题时被误导。建议至少在 metrics 统计时改为从 self.requests(或其它能覆盖预分配请求的集合)聚合 block_tables,或在预分配阶段记录占用以保证指标准确。
request.block_tables = self._allocate_gpu_blocks(request, need_prealloc_prefill_blocks)
request.num_computed_tokens = request.need_prefill_tokens
request.disaggregate_info["block_tables"] = request.block_tables
allocated_position = self.get_available_position()
request.idx = allocated_position
self.stop_flags[request.idx] = False
self.requests[request.request_id] = request
self.req_dict[request.request_id] = allocated_position
return True
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-04-24 15:32:20
📋 Review 摘要
PR 概述:修复 PD 分离式推理中的多项问题,包括 decode 端 tasks_list 注册时机、错误消息统一化、非流式请求抢占自动重试及相关 serving/connector 逻辑修复。
变更范围:fastdeploy/engine/、fastdeploy/entrypoints/openai/、fastdeploy/router/、fastdeploy/splitwise/、fastdeploy/input/、fastdeploy/output/
影响面 Tag:[PD Disaggregation] [Engine] [APIServer]
📝 PR 规范检查
标题使用了非官方 Tag [PD],官方 Tag 列表中最近义为 [BugFix](主要为 Bug 修复)或 [PD Disaggregation];PR 描述缺少 Usage or Command、Accuracy Tests、Checklist 段落。
标题建议(可直接复制):
[BugFix] Fix PD interaction race condition and error response handling
PR 描述建议(可直接复制):
## Motivation
修复 PD 分离式推理场景下的多项问题:
1. decode 端 tasks_list 注册时机过早,导致请求在 prefill 完成前被 batch output 处理引发空指针;
2. 错误消息不统一,不便于 Router 层识别可重试错误;
3. 非流式请求在 decode 抢占时缺少自动重试机制;
4. splitwise_connector 中 send_cache_info_to_prefill 存在逻辑错误,资源不足后未能及时通知 P 实例。
## Modifications
1. `resource_manager_v1.py`:将 `tasks_list` 注册从 `preallocate_resource_in_d` 延迟到 `add_prefilled_request`,并在 `_process_batch_output` 增加 None 检查;
2. `common_engine.py` / `token_processor.py`:统一错误消息前缀为 `PD Error`,Router 和 Serving 层据此识别 PD 错误并设置 `finish_reason=pd_reschedule`;
3. `router.py`:新增 `preempt_retry_count` / `preempt_retry_exclude_decode` 参数,非流式请求 decode 抢占时自动重试;
4. `serving_chat.py` / `serving_completion.py`:错误路径保留已生成 token,返回部分结果而非直接抛异常;
5. `splitwise_connector.py`:修复 `send_cache_info_to_prefill` 逻辑,确保资源不足时也能及时通知 P 实例。
## Usage or Command
N/A
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/entrypoints/openai/serving_chat.py:661 |
错误路径 completion_tokens 值为空字符串,与正常路径类型不一致 |
| 🟡 建议 | fastdeploy/entrypoints/openai/serving_completion.py:331 |
错误路径 completion_tokens 值为空字符串,与正常路径类型不一致 |
总体评价
本 PR 修复了 PD 分离场景下多个关键问题(tasks_list 注册时机、None 检查、错误消息统一、connector 通知逻辑),整体设计合理,retry 机制与 completion_token_ids 传递链路设计合理,测试覆盖较为完整,无阻塞性问题。有 2 处 completion_tokens 类型一致性细节建议改进。
| # Error response - include already-generated tokens in the response | ||
| data["outputs"] = { | ||
| "text": "", | ||
| "completion_tokens": "", |
There was a problem hiding this comment.
🟡 建议 completion_tokens 值为空字符串 "",应改为 None 或 0。
当 finish_reason == "pd_reschedule" 时,响应会返回 completion_token_ids(token 列表)但 completion_tokens(token 计数)为 "",二者语义不一致,可能误导 API 调用方。
建议修改为:
"completion_tokens": None,| raise ValueError("{}".format(data["error_msg"])) | ||
| data["outputs"] = { | ||
| "text": "", | ||
| "completion_tokens": "", |
There was a problem hiding this comment.
🟡 建议 completion_tokens 值为空字符串 "",应改为 None 或 0。
错误路径返回的 outputs 中 completion_tokens 类型不一致(正常路径为数字/None),建议统一为 None:
"completion_tokens": None,
Motivation
修复 PD 分离式推理场景下的多项问题:
Modifications
tasks_list注册从preallocate_resource_in_d延迟到add_prefilled_request,配合_process_batch_output增加 None 检查,避免worker生成无用的输出导致错误;PD Error,Router 和 Serving 层据此识别 PD 错误并设置finish_reason=pd_reschedule,便于roter支持重调度;preempt_retry_count/preempt_retry_exclude_decode参数,非流式请求 decode 抢占时自动重试;serving_chat错误路径保留已生成 token,返回部分结果而非直接抛异常;splitwise_connector修复send_cache_info_to_prefill逻辑错误,避免资源不足后不会及时通知P实例。