Skip to content

[XPU] improce attn precision#7515

Open
lizan1999 wants to merge 1 commit into
PaddlePaddle:developfrom
lizan1999:imporve_attn_precision
Open

[XPU] improce attn precision#7515
lizan1999 wants to merge 1 commit into
PaddlePaddle:developfrom
lizan1999:imporve_attn_precision

Conversation

@lizan1999
Copy link
Copy Markdown
Contributor

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

improve speculative_attention_decoder TGEMM from tfloat32 to float
As for flash_attention_context_vllm and paged_attention_xft, use the dedicated version of vLLM directly.

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 20, 2026

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the XPU label Apr 20, 2026
PaddlePaddle-bot

This comment was marked as outdated.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


lizan1999 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-24 14:24:48

📋 Review 摘要

PR 概述:针对 XPU 后端提升注意力计算精度,将 speculative_attention_decoder 的 TGEMM 类型从 tfloat32 改为 float,并将 spliced 路径下的 NEOX RoPE 替换为 infer_ops::rotary_embedding 新接口。
变更范围custom_ops/xpu_ops/src/ops/fastdeploy/model_executor/layers/
影响面 Tag[XPU] [OP] [Optimization]


📝 PR 规范检查

标题存在拼写错误improce 应为 improve;建议同时补充具体优化类型的 Tag;Motivation 章节未填写Accuracy Tests 章节为空——本 PR 涉及精度类型变更(tfloat32float),必须提供精度对比数据。

标题建议(可直接复制):

  • [XPU][Optimization] Improve XPU attention precision (TGEMM tfloat32 -> float)

描述建议(Motivation 可直接复制):

The speculative_attention_decoder uses tfloat32 for TGEMM accumulation when XPU_XType == XPU_CType, which leads to precision loss. This PR changes TGEMM to float to improve numerical accuracy. Additionally, the NEOX RoPE in the spliced kvcache path is replaced with the newer infer_ops::rotary_embedding API for better compatibility with vLLM-dedicated implementations.


问题

级别 文件 概述
🟡 建议 block_attn_spliced.cc:305-319 encoder neox 路径:do_host2device 和两个 copy 调用均缺少 PD_CHECK
🟡 建议 block_attn_spliced.cc:645-658 decoder neox 路径:copy×2 和 cast 调用均缺少 PD_CHECK
🟡 建议 block_attn_spliced.cc:297,346,660 调试用 set_debug_level 注释代码遗留,应在合并前清理
🟡 建议 block_attn_spliced.cc:320-327 xpu_memcpy 注释代码遗留,应删除
🟡 建议 block_attn_spliced_bf16.txt 将 C++ 源码以 .txt 格式提交,疑似调试产物,应删除或改为正式 .cc 文件
❓ 疑问 block_attn_spliced.cc:299-308 encoder spliced 路径生成全局顺序 positions [0..token_num-1],多 batch 场景下 positions 是否仍正确?
❓ 疑问 block_attn_spliced.cc:1055,1170,1211 TR 模板参数由硬编码 float 改为 XPU_XType,需确认调用侧传入的 rotary_embs 张量类型与 XPU_XType 一致

总体评价

精度改进方向正确,TGEMM 从 tfloat32 换为 float 有明确收益。但新的 NEOX RoPE 路径中存在多处 API 返回值未检查、调试代码/注释未清理,以及一个疑似误提交的 .txt 源文件,建议在合并前一并处理,并补充必要的精度对比测试结果。

PD_CHECK(ret == api::SUCCESS, "vsl_rotary_embedding_gptj failed.");
} else {
ret = infer_ops::vsl_rotary_embedding_neox<TQKV, TR, TID>(
// xpu_ctx->set_debug_level(0xa1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 调试代码遗留

// xpu_ctx->set_debug_level(0xa1); 是调试专用代码,同文件中还有另外两处相同注释(encoder 末尾和 decoder 内)。建议在合并前统一删除,避免误导后续维护者。

ret = api::do_host2device(xpu_ctx,
positions_host.data(),
positions_tensor.data<int64_t>(),
token_num * sizeof(int64_t));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 api::do_host2device 返回值未检查

此处 ret 被直接覆盖,若 H2D 传输失败会静默忽略错误,导致后续 rotary_embedding 使用未初始化的 positions 数据。请参考文件其他位置的写法,补充:

PD_CHECK(ret == api::SUCCESS, "api::do_host2device failed.");

xpu_ctx,
reinterpret_cast<const TQKV*>(q_split.data()),
const_cast<TQKV*>(reinterpret_cast<const TQKV*>(q.data())),
token_num * q_num_heads * head_dim);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 api::copy<TQKV>(Q)返回值未检查

连续两次 api::copy 调用(Q 和 K)均未加 PD_CHECK,若拷贝失败会导致 rotary_embedding 使用错误数据。建议在每次调用后补充:

PD_CHECK(ret == api::SUCCESS, "api::copy q failed.");

reinterpret_cast<const TQKV*>(k_split.data()),
reinterpret_cast<const float*>(rotary_embs.data<float>()),
const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())),
token_num * real_kv_num_heads * head_dim);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 api::copy<TQKV>(K)返回值未检查

同上,补充:

PD_CHECK(ret == api::SUCCESS, "api::copy k failed.");

reinterpret_cast<const float*>(rotary_embs.data<float>()),
const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())),
token_num * real_kv_num_heads * head_dim);
// xpu_memcpy(const_cast<void*>(q.data()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 注释掉的 xpu_memcpy 代码应删除

共 8 行 // xpu_memcpy(...) 注释代码是实现过程中的遗留物,已被上方的 api::copy 取代。请在合并前删除,保持代码整洁。

paddle::empty({token_num}, paddle::DataType::INT64, qkv.place());
ret = api::copy<TQKV>(
xpu_ctx,
reinterpret_cast<const TQKV*>(q_split.data()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 decoder 路径 api::copy<TQKV> 返回值未检查

decoder neox 路径中连续两次 api::copy(Q/K)和一次 api::cast<int, int64_t> 均未加 PD_CHECK,与文件其余代码风格不一致。建议每次调用后添加错误检查。

const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())),
token_num * real_kv_num_heads * head_dim);
ret = api::cast<int, int64_t>(xpu_ctx,
start_tokens.xpu,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 api::cast<int, int64_t> 返回值未检查

start_tokens.xpuint64_t 的 cast 失败同样会导致 positions 数据错误,建议补充:

PD_CHECK(ret == api::SUCCESS, "api::cast start_tokens failed.");

@@ -0,0 +1,1913 @@
// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 疑似调试产物,不应提交

此文件包含 1913 行 C++ 源码,但扩展名为 .txt,无法参与编译,疑似开发过程中的备份或草稿文件。请确认:

  1. 若是正式代码,应使用 .cc 扩展名并纳入 CMakeLists;
  2. 若是临时调试产物,请在合并前删除。

@PaddlePaddle-bot
Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-28 17:28:27

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

所有 required 任务通过(当前无 required 任务配置),可选任务存在 1 个失败,不阻塞合并。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
2(0) 2 1 1 0 0 0

2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前 PR 无必选任务配置(Branch Protection Rules 未设置 required checks),无阻塞合并的必选任务。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR 53s Job -
其余 1 个可选任务通过 - - -

3 失败详情(仅 required)

无 required 失败任务。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants