84 lines
2.9 KiB
Markdown
84 lines
2.9 KiB
Markdown
|
|
# CUDA 13 TMA Descriptor Notes — CRITICAL REFERENCE
|
||
|
|
|
||
|
|
## Date: 2026-05-29
|
||
|
|
## Status: Verified on B200 (driver 580.126.20 = CUDA 13.0, toolkit 13.2, SM100)
|
||
|
|
|
||
|
|
## Three Breaking Changes from CUDA 12 → CUDA 13
|
||
|
|
|
||
|
|
### 1. `globalStrides` are now in BYTES, not elements
|
||
|
|
|
||
|
|
**CUDA 12:**
|
||
|
|
```c
|
||
|
|
uint64_t gs[] = {1, cols}; // element strides
|
||
|
|
```
|
||
|
|
|
||
|
|
**CUDA 13:**
|
||
|
|
```c
|
||
|
|
uint64_t gs[] = {cols * 2, cols * 2 * rows}; // byte strides (for BF16)
|
||
|
|
```
|
||
|
|
|
||
|
|
This was the root cause of ALL `cuTensorMapEncodeTiled` failures returning
|
||
|
|
INVALID_VALUE (error=1). The old element-based strides produce byte values
|
||
|
|
(1, 64) which aren't multiples of 16 — violating the constraint that
|
||
|
|
`globalStrides[i]` must be a multiple of 16 bytes.
|
||
|
|
|
||
|
|
### 2. `tensorRank` minimum is 2 (1D still works but limited)
|
||
|
|
|
||
|
|
CUDA 13 `cuTensorMapEncodeTiled` supports rank 1-5. Rank 2+ works with byte
|
||
|
|
strides. Rank 1 works with element strides (no strides to convert).
|
||
|
|
|
||
|
|
**For 2D+ descriptors, always use byte strides.**
|
||
|
|
|
||
|
|
### 3. `BFLOAT16` data type is now available
|
||
|
|
|
||
|
|
`CU_TENSOR_MAP_DATA_TYPE_BFLOAT16` exists in CUDA 13. Use it instead of
|
||
|
|
`CU_TENSOR_MAP_DATA_TYPE_UINT16` for BF16 tensors.
|
||
|
|
|
||
|
|
## TMA Descriptor Creation via Driver API — KNOWN ISSUE
|
||
|
|
|
||
|
|
On driver 580.126.20 (CUDA 13.0) + toolkit 13.2:
|
||
|
|
- `cuTensorMapEncodeTiled` succeeds for 2D/3D descriptors with byte strides ✅
|
||
|
|
- BUT `cp.async.bulk.tensor.{2d,3d}` PTX instruction HANGS with these descriptors ❌
|
||
|
|
- mbarrier never signals completion
|
||
|
|
|
||
|
|
Root cause: likely a descriptor format mismatch. The toolkit 13.2
|
||
|
|
`cuTensorMapEncodeTiled` may produce descriptors that the driver 13.0
|
||
|
|
TMA hardware can't read. CUTLASS has driver-version-specific workarounds
|
||
|
|
(see `copy_traits_sm90_tma.hpp` — they clear bit 21 of desc[1] for
|
||
|
|
driver <= 13.1 with small tensors).
|
||
|
|
|
||
|
|
**Working path:** Use CuTeDSL's `tma_partition` to create descriptors.
|
||
|
|
CuTeDSL handles the driver version internally and produces descriptors
|
||
|
|
that the GPU TMA hardware accepts.
|
||
|
|
|
||
|
|
## 3D Descriptors (recommended for CUDA 13)
|
||
|
|
|
||
|
|
Use 3D descriptors with degenerate 3rd dimension = 1:
|
||
|
|
```c
|
||
|
|
uint64_t gd[] = {cols, rows, 1};
|
||
|
|
uint64_t gs[] = {cols * 2, cols * 2 * rows}; // byte strides
|
||
|
|
uint32_t td[] = {tile_cols, tile_rows, 1};
|
||
|
|
uint32_t ts[] = {1, 1, 1}; // element strides within tile
|
||
|
|
```
|
||
|
|
|
||
|
|
Kernel uses `cp.async.bulk.tensor.3d` with coordinates {x, y, 0}.
|
||
|
|
|
||
|
|
## mbarrier for TMA
|
||
|
|
|
||
|
|
For `complete_tx::bytes` mode:
|
||
|
|
- `mbarrier.init` expected count = number of bytes to transfer
|
||
|
|
(e.g., 128 * 16 * 2 = 4096 for a (128,16) BF16 tile)
|
||
|
|
- OR count = 1 (some implementations use this)
|
||
|
|
|
||
|
|
Both have been tested; the hang is NOT caused by the mbarrier count.
|
||
|
|
|
||
|
|
## Files Archived
|
||
|
|
|
||
|
|
The driver-API TMA implementation is archived at:
|
||
|
|
- `dsv4/kernels/attention/archive/fmha_tma_driver_api.cuh` — descriptor helpers
|
||
|
|
- `dsv4/kernels/attention/archive/fmha_6warp_tma_driver_api.cuh` — TMA kernel
|
||
|
|
- `tests/unit/archive/test_fmha_tma_driver_api.cu` — test
|
||
|
|
|
||
|
|
These work correctly EXCEPT for the descriptor format issue.
|
||
|
|
When the B200 driver is updated to 13.2+, these may work directly.
|