cp.async ergonomics: missing address-space-aware intrinsic forces inline PTX

## Summary

Using `cp.async` from Rust-CUDA currently requires handwritten inline PTX plus manual shared-address conversion in user kernels.

In practice, the missing piece is an address-space-aware API/intrinsic for `cp.async.cg.shared.global`.

## Problem

For an async copy path like:
- global source pointer
- shared-memory destination
- 16-byte copy (`4 x f32`)

users currently end up writing something like:

```rust
unsafe fn shared_addr(ptr: *mut f32) -> u32 {
    let mut addr: u64;
    asm!(
        "cvta.to.shared.u64 {dst}, {src};",
        dst = out(reg64) addr,
        src = in(reg64) ptr,
    );
    addr as u32
}

unsafe fn cp_async4(dst_shared: u32, src_global: *const f32) {
    asm!(
        "cp.async.cg.shared.global [{dst}], [{src}], 16;",
        dst = in(reg32) dst_shared,
        src = in(reg64) src_global,
    );
}
```

and then precompute shared base addresses in the kernel and pass integer offsets into the helper.

If we instead pass generic/raw pointers through a helper, PTX tends to contain extra address-space conversion glue around the `cp.async` sites.

## Why this matters

This is exactly the kind of operation where users want:
- a small, explicit intrinsic
- correct shared/global address-space handling
- no repeated manual inline PTX in every project

Right now, getting good code requires low-level PTX knowledge and manual control over shared address conversion.

## Requested improvement

A Rust-CUDA intrinsic/helper for `cp.async.cg.shared.global` (or equivalent family) that:
- represents the destination as shared-memory address space explicitly
- avoids forcing users to manually convert `*mut T` into a `u32` shared address
- maps cleanly to the PTX async-copy instructions used in modern CUDA kernels

Even a low-level unsafe API would be useful if it preserves the right address-space semantics and avoids generic-pointer friction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp.async ergonomics: missing address-space-aware intrinsic forces inline PTX #378

Summary

Problem

Why this matters

Requested improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cp.async ergonomics: missing address-space-aware intrinsic forces inline PTX #378

Description

Summary

Problem

Why this matters

Requested improvement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions