Summary
Using cp.async from Rust-CUDA currently requires handwritten inline PTX plus manual shared-address conversion in user kernels.
In practice, the missing piece is an address-space-aware API/intrinsic for cp.async.cg.shared.global.
Problem
For an async copy path like:
- global source pointer
- shared-memory destination
- 16-byte copy (
4 x f32)
users currently end up writing something like:
unsafe fn shared_addr(ptr: *mut f32) -> u32 {
let mut addr: u64;
asm!(
"cvta.to.shared.u64 {dst}, {src};",
dst = out(reg64) addr,
src = in(reg64) ptr,
);
addr as u32
}
unsafe fn cp_async4(dst_shared: u32, src_global: *const f32) {
asm!(
"cp.async.cg.shared.global [{dst}], [{src}], 16;",
dst = in(reg32) dst_shared,
src = in(reg64) src_global,
);
}
and then precompute shared base addresses in the kernel and pass integer offsets into the helper.
If we instead pass generic/raw pointers through a helper, PTX tends to contain extra address-space conversion glue around the cp.async sites.
Why this matters
This is exactly the kind of operation where users want:
- a small, explicit intrinsic
- correct shared/global address-space handling
- no repeated manual inline PTX in every project
Right now, getting good code requires low-level PTX knowledge and manual control over shared address conversion.
Requested improvement
A Rust-CUDA intrinsic/helper for cp.async.cg.shared.global (or equivalent family) that:
- represents the destination as shared-memory address space explicitly
- avoids forcing users to manually convert
*mut T into a u32 shared address
- maps cleanly to the PTX async-copy instructions used in modern CUDA kernels
Even a low-level unsafe API would be useful if it preserves the right address-space semantics and avoids generic-pointer friction.
Summary
Using
cp.asyncfrom Rust-CUDA currently requires handwritten inline PTX plus manual shared-address conversion in user kernels.In practice, the missing piece is an address-space-aware API/intrinsic for
cp.async.cg.shared.global.Problem
For an async copy path like:
4 x f32)users currently end up writing something like:
and then precompute shared base addresses in the kernel and pass integer offsets into the helper.
If we instead pass generic/raw pointers through a helper, PTX tends to contain extra address-space conversion glue around the
cp.asyncsites.Why this matters
This is exactly the kind of operation where users want:
Right now, getting good code requires low-level PTX knowledge and manual control over shared address conversion.
Requested improvement
A Rust-CUDA intrinsic/helper for
cp.async.cg.shared.global(or equivalent family) that:*mut Tinto au32shared addressEven a low-level unsafe API would be useful if it preserves the right address-space semantics and avoids generic-pointer friction.