[XPU] improce attn precision#7515
Conversation
|
Thanks for your contribution! |
d48b804 to
8b1055f
Compare
|
lizan1999 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
8b1055f to
1ecb2db
Compare
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-24 14:24:48
📋 Review 摘要
PR 概述:针对 XPU 后端提升注意力计算精度,将 speculative_attention_decoder 的 TGEMM 类型从 tfloat32 改为 float,并将 spliced 路径下的 NEOX RoPE 替换为 infer_ops::rotary_embedding 新接口。
变更范围:custom_ops/xpu_ops/src/ops/、fastdeploy/model_executor/layers/
影响面 Tag:[XPU] [OP] [Optimization]
📝 PR 规范检查
标题存在拼写错误,improce 应为 improve;建议同时补充具体优化类型的 Tag;Motivation 章节未填写,Accuracy Tests 章节为空——本 PR 涉及精度类型变更(tfloat32 → float),必须提供精度对比数据。
标题建议(可直接复制):
[XPU][Optimization] Improve XPU attention precision (TGEMM tfloat32 -> float)
描述建议(Motivation 可直接复制):
The speculative_attention_decoder uses
tfloat32for TGEMM accumulation whenXPU_XType == XPU_CType, which leads to precision loss. This PR changes TGEMM tofloatto improve numerical accuracy. Additionally, the NEOX RoPE in the spliced kvcache path is replaced with the newerinfer_ops::rotary_embeddingAPI for better compatibility with vLLM-dedicated implementations.
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | block_attn_spliced.cc:305-319 |
encoder neox 路径:do_host2device 和两个 copy 调用均缺少 PD_CHECK |
| 🟡 建议 | block_attn_spliced.cc:645-658 |
decoder neox 路径:copy×2 和 cast 调用均缺少 PD_CHECK |
| 🟡 建议 | block_attn_spliced.cc:297,346,660 |
调试用 set_debug_level 注释代码遗留,应在合并前清理 |
| 🟡 建议 | block_attn_spliced.cc:320-327 |
旧 xpu_memcpy 注释代码遗留,应删除 |
| 🟡 建议 | block_attn_spliced_bf16.txt |
将 C++ 源码以 .txt 格式提交,疑似调试产物,应删除或改为正式 .cc 文件 |
| ❓ 疑问 | block_attn_spliced.cc:299-308 |
encoder spliced 路径生成全局顺序 positions [0..token_num-1],多 batch 场景下 positions 是否仍正确? |
| ❓ 疑问 | block_attn_spliced.cc:1055,1170,1211 |
TR 模板参数由硬编码 float 改为 XPU_XType,需确认调用侧传入的 rotary_embs 张量类型与 XPU_XType 一致 |
总体评价
精度改进方向正确,TGEMM 从 tfloat32 换为 float 有明确收益。但新的 NEOX RoPE 路径中存在多处 API 返回值未检查、调试代码/注释未清理,以及一个疑似误提交的 .txt 源文件,建议在合并前一并处理,并补充必要的精度对比测试结果。
| PD_CHECK(ret == api::SUCCESS, "vsl_rotary_embedding_gptj failed."); | ||
| } else { | ||
| ret = infer_ops::vsl_rotary_embedding_neox<TQKV, TR, TID>( | ||
| // xpu_ctx->set_debug_level(0xa1); |
There was a problem hiding this comment.
🟡 建议 调试代码遗留
// xpu_ctx->set_debug_level(0xa1); 是调试专用代码,同文件中还有另外两处相同注释(encoder 末尾和 decoder 内)。建议在合并前统一删除,避免误导后续维护者。
| ret = api::do_host2device(xpu_ctx, | ||
| positions_host.data(), | ||
| positions_tensor.data<int64_t>(), | ||
| token_num * sizeof(int64_t)); |
There was a problem hiding this comment.
🟡 建议 api::do_host2device 返回值未检查
此处 ret 被直接覆盖,若 H2D 传输失败会静默忽略错误,导致后续 rotary_embedding 使用未初始化的 positions 数据。请参考文件其他位置的写法,补充:
PD_CHECK(ret == api::SUCCESS, "api::do_host2device failed.");| xpu_ctx, | ||
| reinterpret_cast<const TQKV*>(q_split.data()), | ||
| const_cast<TQKV*>(reinterpret_cast<const TQKV*>(q.data())), | ||
| token_num * q_num_heads * head_dim); |
There was a problem hiding this comment.
🟡 建议 api::copy<TQKV>(Q)返回值未检查
连续两次 api::copy 调用(Q 和 K)均未加 PD_CHECK,若拷贝失败会导致 rotary_embedding 使用错误数据。建议在每次调用后补充:
PD_CHECK(ret == api::SUCCESS, "api::copy q failed.");| reinterpret_cast<const TQKV*>(k_split.data()), | ||
| reinterpret_cast<const float*>(rotary_embs.data<float>()), | ||
| const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())), | ||
| token_num * real_kv_num_heads * head_dim); |
There was a problem hiding this comment.
🟡 建议 api::copy<TQKV>(K)返回值未检查
同上,补充:
PD_CHECK(ret == api::SUCCESS, "api::copy k failed.");| reinterpret_cast<const float*>(rotary_embs.data<float>()), | ||
| const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())), | ||
| token_num * real_kv_num_heads * head_dim); | ||
| // xpu_memcpy(const_cast<void*>(q.data()), |
There was a problem hiding this comment.
🟡 建议 注释掉的 xpu_memcpy 代码应删除
共 8 行 // xpu_memcpy(...) 注释代码是实现过程中的遗留物,已被上方的 api::copy 取代。请在合并前删除,保持代码整洁。
| paddle::empty({token_num}, paddle::DataType::INT64, qkv.place()); | ||
| ret = api::copy<TQKV>( | ||
| xpu_ctx, | ||
| reinterpret_cast<const TQKV*>(q_split.data()), |
There was a problem hiding this comment.
🟡 建议 decoder 路径 api::copy<TQKV> 返回值未检查
decoder neox 路径中连续两次 api::copy(Q/K)和一次 api::cast<int, int64_t> 均未加 PD_CHECK,与文件其余代码风格不一致。建议每次调用后添加错误检查。
| const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())), | ||
| token_num * real_kv_num_heads * head_dim); | ||
| ret = api::cast<int, int64_t>(xpu_ctx, | ||
| start_tokens.xpu, |
There was a problem hiding this comment.
🟡 建议 api::cast<int, int64_t> 返回值未检查
start_tokens.xpu → int64_t 的 cast 失败同样会导致 positions 数据错误,建议补充:
PD_CHECK(ret == api::SUCCESS, "api::cast start_tokens failed.");| @@ -0,0 +1,1913 @@ | |||
| // Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
🟡 建议 疑似调试产物,不应提交
此文件包含 1913 行 C++ 源码,但扩展名为 .txt,无法参与编译,疑似开发过程中的备份或草稿文件。请确认:
- 若是正式代码,应使用
.cc扩展名并纳入 CMakeLists; - 若是临时调试产物,请在合并前删除。
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览所有 required 任务通过(当前无 required 任务配置),可选任务存在 1 个失败,不阻塞合并。
2 任务状态汇总2.1 Required任务 : 0/0 通过
2.2 可选任务 — 1/2 通过
3 失败详情(仅 required)无 required 失败任务。 |
Motivation
Modifications
improve speculative_attention_decoder TGEMM from tfloat32 to float
As for flash_attention_context_vllm and paged_attention_xft, use the dedicated version of vLLM directly.
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.