[XPU] improce attn precision by lizan1999 · Pull Request #7515 · PaddlePaddle/FastDeploy

lizan1999 · 2026-04-20T07:49:17Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

improve speculative_attention_decoder TGEMM from tfloat32 to float
As for flash_attention_context_vllm and paged_attention_xft, use the dedicated version of vLLM directly.

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-20T07:49:23Z

Thanks for your contribution!

CLAassistant · 2026-04-23T12:12:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

lizan1999 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-24 14:24:48

📋 Review 摘要

PR 概述：针对 XPU 后端提升注意力计算精度，将 speculative_attention_decoder 的 TGEMM 类型从 tfloat32 改为 float，并将 spliced 路径下的 NEOX RoPE 替换为 infer_ops::rotary_embedding 新接口。
变更范围：custom_ops/xpu_ops/src/ops/、fastdeploy/model_executor/layers/
影响面 Tag：[XPU] [OP] [Optimization]

📝 PR 规范检查

标题存在拼写错误，improce 应为 improve；建议同时补充具体优化类型的 Tag；Motivation 章节未填写，Accuracy Tests 章节为空——本 PR 涉及精度类型变更（tfloat32 → float），必须提供精度对比数据。

标题建议（可直接复制）：

[XPU][Optimization] Improve XPU attention precision (TGEMM tfloat32 -> float)

描述建议（Motivation 可直接复制）：

The speculative_attention_decoder uses tfloat32 for TGEMM accumulation when XPU_XType == XPU_CType, which leads to precision loss. This PR changes TGEMM to float to improve numerical accuracy. Additionally, the NEOX RoPE in the spliced kvcache path is replaced with the newer infer_ops::rotary_embedding API for better compatibility with vLLM-dedicated implementations.

问题

级别	文件	概述
🟡 建议	`block_attn_spliced.cc:305-319`	encoder neox 路径：`do_host2device` 和两个 `copy` 调用均缺少 `PD_CHECK`
🟡 建议	`block_attn_spliced.cc:645-658`	decoder neox 路径：`copy`×2 和 `cast` 调用均缺少 `PD_CHECK`
🟡 建议	`block_attn_spliced.cc:297,346,660`	调试用 `set_debug_level` 注释代码遗留，应在合并前清理
🟡 建议	`block_attn_spliced.cc:320-327`	旧 `xpu_memcpy` 注释代码遗留，应删除
🟡 建议	`block_attn_spliced_bf16.txt`	将 C++ 源码以 `.txt` 格式提交，疑似调试产物，应删除或改为正式 `.cc` 文件
❓ 疑问	`block_attn_spliced.cc:299-308`	encoder spliced 路径生成全局顺序 positions `[0..token_num-1]`，多 batch 场景下 positions 是否仍正确？
❓ 疑问	`block_attn_spliced.cc:1055,1170,1211`	TR 模板参数由硬编码 `float` 改为 `XPU_XType`，需确认调用侧传入的 `rotary_embs` 张量类型与 `XPU_XType` 一致

总体评价

精度改进方向正确，TGEMM 从 tfloat32 换为 float 有明确收益。但新的 NEOX RoPE 路径中存在多处 API 返回值未检查、调试代码/注释未清理，以及一个疑似误提交的 .txt 源文件，建议在合并前一并处理，并补充必要的精度对比测试结果。

PaddlePaddle-bot · 2026-04-24T06:31:39Z

      PD_CHECK(ret == api::SUCCESS, "vsl_rotary_embedding_gptj failed.");
    } else {
-      ret = infer_ops::vsl_rotary_embedding_neox<TQKV, TR, TID>(
+      //   xpu_ctx->set_debug_level(0xa1);


🟡 建议 调试代码遗留

// xpu_ctx->set_debug_level(0xa1); 是调试专用代码，同文件中还有另外两处相同注释（encoder 末尾和 decoder 内）。建议在合并前统一删除，避免误导后续维护者。

PaddlePaddle-bot · 2026-04-24T06:31:39Z

+      ret = api::do_host2device(xpu_ctx,
+                                positions_host.data(),
+                                positions_tensor.data<int64_t>(),
+                                token_num * sizeof(int64_t));


🟡 建议 api::do_host2device 返回值未检查

此处 ret 被直接覆盖，若 H2D 传输失败会静默忽略错误，导致后续 rotary_embedding 使用未初始化的 positions 数据。请参考文件其他位置的写法，补充：

PD_CHECK(ret == api::SUCCESS, "api::do_host2device failed.");

PaddlePaddle-bot · 2026-04-24T06:31:39Z

          xpu_ctx,
          reinterpret_cast<const TQKV*>(q_split.data()),
+          const_cast<TQKV*>(reinterpret_cast<const TQKV*>(q.data())),
+          token_num * q_num_heads * head_dim);


🟡 建议 api::copy<TQKV>（Q）返回值未检查

连续两次 api::copy 调用（Q 和 K）均未加 PD_CHECK，若拷贝失败会导致 rotary_embedding 使用错误数据。建议在每次调用后补充：

PD_CHECK(ret == api::SUCCESS, "api::copy q failed.");

PaddlePaddle-bot · 2026-04-24T06:31:39Z

          reinterpret_cast<const TQKV*>(k_split.data()),
-          reinterpret_cast<const float*>(rotary_embs.data<float>()),
+          const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())),
+          token_num * real_kv_num_heads * head_dim);


🟡 建议 api::copy<TQKV>（K）返回值未检查

同上，补充：

PD_CHECK(ret == api::SUCCESS, "api::copy k failed.");

PaddlePaddle-bot · 2026-04-24T06:31:39Z

-          reinterpret_cast<const float*>(rotary_embs.data<float>()),
+          const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())),
+          token_num * real_kv_num_heads * head_dim);
+      //   xpu_memcpy(const_cast<void*>(q.data()),


🟡 建议 注释掉的 xpu_memcpy 代码应删除

共 8 行 // xpu_memcpy(...) 注释代码是实现过程中的遗留物，已被上方的 api::copy 取代。请在合并前删除，保持代码整洁。

PaddlePaddle-bot · 2026-04-24T06:31:39Z

+          paddle::empty({token_num}, paddle::DataType::INT64, qkv.place());
+      ret = api::copy<TQKV>(
          xpu_ctx,
          reinterpret_cast<const TQKV*>(q_split.data()),


🟡 建议 decoder 路径 api::copy<TQKV> 返回值未检查

decoder neox 路径中连续两次 api::copy（Q/K）和一次 api::cast<int, int64_t> 均未加 PD_CHECK，与文件其余代码风格不一致。建议每次调用后添加错误检查。

PaddlePaddle-bot · 2026-04-24T06:31:39Z

+          const_cast<TQKV*>(reinterpret_cast<const TQKV*>(k.data())),
+          token_num * real_kv_num_heads * head_dim);
+      ret = api::cast<int, int64_t>(xpu_ctx,
+                                    start_tokens.xpu,


🟡 建议 api::cast<int, int64_t> 返回值未检查

start_tokens.xpu → int64_t 的 cast 失败同样会导致 positions 数据错误，建议补充：

PD_CHECK(ret == api::SUCCESS, "api::cast start_tokens failed.");

PaddlePaddle-bot · 2026-04-24T06:31:39Z

@@ -0,0 +1,1913 @@
+// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.


🟡 建议 疑似调试产物，不应提交

此文件包含 1913 行 C++ 源码，但扩展名为 .txt，无法参与编译，疑似开发过程中的备份或草稿文件。请确认：

若是正式代码，应使用 .cc 扩展名并纳入 CMakeLists；

若是临时调试产物，请在合并前删除。

PaddlePaddle-bot · 2026-04-28T09:29:18Z

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-28 17:28:27

CI报告基于以下代码生成（30分钟更新一次）:

PR commit: 1ecb2db
Merge base: a0c39cc (branch: develop)
查看完整 Diff
CI 详情

1 任务总览

所有 required 任务通过（当前无 required 任务配置），可选任务存在 1 个失败，不阻塞合并。

总执行（rerun次数）	总任务	✅ 通过	❌ 失败	⏳ 运行中	⏸️ 等待中	跳过
2(0)	2	1	1	0	0	0

2 任务状态汇总

2.1 Required任务 : 0/0 通过

当前 PR 无必选任务配置（Branch Protection Rules 未设置 required checks），无阻塞合并的必选任务。

2.2 可选任务 — 1/2 通过

可选任务不阻塞合并，失败仅供参考。

状态	任务	耗时	日志	重跑
❌	`Trigger Jenkins for PR`	53s	Job	-
✅	其余 1 个可选任务通过	-	-	-

3 失败详情（仅 required）

无 required 失败任务。

lizan1999 temporarily deployed to Metax_ci April 20, 2026 07:49 — with GitHub Actions Inactive

paddle-bot Bot added the XPU label Apr 20, 2026

This comment was marked as outdated.

Sign in to view

lizan1999 force-pushed the imporve_attn_precision branch from d48b804 to 8b1055f Compare April 23, 2026 12:12

lizan1999 had a problem deploying to Metax_ci April 23, 2026 12:12 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

[XPU] improce attn precision

1ecb2db

lizan1999 force-pushed the imporve_attn_precision branch from 8b1055f to 1ecb2db Compare April 24, 2026 06:17

lizan1999 had a problem deploying to Metax_ci April 24, 2026 06:18 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Apr 24, 2026

View reviewed changes

		@@ -0,0 +1,1913 @@
		// Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved.

Conversation

lizan1999 commented Apr 20, 2026

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot Bot commented Apr 20, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

CLAassistant commented Apr 23, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaddlePaddle-bot commented Apr 28, 2026

1 任务总览

2 任务状态汇总

2.1 Required任务 : 0/0 通过

2.2 可选任务 — 1/2 通过

3 失败详情（仅 required）

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants