docs/cuda13_tma_notes.md

# CUDA 13 TMA Descriptor Notes — CRITICAL REFERENCE

## Date: 2026-05-29
## Status: Verified on B200 (driver 580.126.20 = CUDA 13.0, toolkit 13.2, SM100)

## Three Breaking Changes from CUDA 12 → CUDA 13

### 1. `globalStrides` are now in BYTES, not elements

**CUDA 12:**
```c
uint64_t gs[] = {1, cols};  // element strides
```

**CUDA 13:**
```c
uint64_t gs[] = {cols * 2, cols * 2 * rows};  // byte strides (for BF16)
```

This was the root cause of ALL `cuTensorMapEncodeTiled` failures returning
INVALID_VALUE (error=1). The old element-based strides produce byte values
(1, 64) which aren't multiples of 16 — violating the constraint that
`globalStrides[i]` must be a multiple of 16 bytes.

### 2. `tensorRank` minimum is 2 (1D still works but limited)

CUDA 13 `cuTensorMapEncodeTiled` supports rank 1-5. Rank 2+ works with byte
strides. Rank 1 works with element strides (no strides to convert).

**For 2D+ descriptors, always use byte strides.**

### 3. `BFLOAT16` data type is now available

`CU_TENSOR_MAP_DATA_TYPE_BFLOAT16` exists in CUDA 13. Use it instead of
`CU_TENSOR_MAP_DATA_TYPE_UINT16` for BF16 tensors.

## TMA Descriptor Creation via Driver API — KNOWN ISSUE

On driver 580.126.20 (CUDA 13.0) + toolkit 13.2:
- `cuTensorMapEncodeTiled` succeeds for 2D/3D descriptors with byte strides ✅
- BUT `cp.async.bulk.tensor.{2d,3d}` PTX instruction HANGS with these descriptors ❌
- mbarrier never signals completion

Root cause: likely a descriptor format mismatch. The toolkit 13.2
`cuTensorMapEncodeTiled` may produce descriptors that the driver 13.0
TMA hardware can't read. CUTLASS has driver-version-specific workarounds
(see `copy_traits_sm90_tma.hpp` — they clear bit 21 of desc[1] for
driver <= 13.1 with small tensors).

**Working path:** Use CuTeDSL's `tma_partition` to create descriptors.
CuTeDSL handles the driver version internally and produces descriptors
that the GPU TMA hardware accepts.

## 3D Descriptors (recommended for CUDA 13)

Use 3D descriptors with degenerate 3rd dimension = 1:
```c
uint64_t gd[] = {cols, rows, 1};
uint64_t gs[] = {cols * 2, cols * 2 * rows};  // byte strides
uint32_t td[] = {tile_cols, tile_rows, 1};
uint32_t ts[] = {1, 1, 1};  // element strides within tile
```

Kernel uses `cp.async.bulk.tensor.3d` with coordinates {x, y, 0}.

## mbarrier for TMA

For `complete_tx::bytes` mode:
- `mbarrier.init` expected count = number of bytes to transfer
  (e.g., 128 * 16 * 2 = 4096 for a (128,16) BF16 tile)
- OR count = 1 (some implementations use this)

Both have been tested; the hang is NOT caused by the mbarrier count.

## Files Archived

The driver-API TMA implementation is archived at:
- `dsv4/kernels/attention/archive/fmha_tma_driver_api.cuh` — descriptor helpers
- `dsv4/kernels/attention/archive/fmha_6warp_tma_driver_api.cuh` — TMA kernel
- `tests/unit/archive/test_fmha_tma_driver_api.cu` — test

These work correctly EXCEPT for the descriptor format issue.
When the B200 driver is updated to 13.2+, these may work directly.
archive: TMA driver-API files + CUDA 13 TMA discovery notes Key findings documented in docs/cuda13_tma_notes.md: - CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures) - BFLOAT16 data type available in CUDA 13 - Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2 - CuTeDSL tma_partition works (production path) Archived (not deleted): - fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu - These will work once driver matches toolkit version 2026-05-29 06:52:39 +00:00			`# CUDA 13 TMA Descriptor Notes — CRITICAL REFERENCE`

			`## Date: 2026-05-29`
			`## Status: Verified on B200 (driver 580.126.20 = CUDA 13.0, toolkit 13.2, SM100)`

			`## Three Breaking Changes from CUDA 12 → CUDA 13`

			### 1. `globalStrides` are now in BYTES, not elements

			`CUDA 12:`
			```c
			`uint64_t gs[] = {1, cols}; // element strides`
			```

			`CUDA 13:`
			```c
			`uint64_t gs[] = {cols * 2, cols * 2 * rows}; // byte strides (for BF16)`
			```

			This was the root cause of ALL `cuTensorMapEncodeTiled` failures returning
			`INVALID_VALUE (error=1). The old element-based strides produce byte values`
			`(1, 64) which aren't multiples of 16 — violating the constraint that`
			`globalStrides[i]` must be a multiple of 16 bytes.

			### 2. `tensorRank` minimum is 2 (1D still works but limited)

			CUDA 13 `cuTensorMapEncodeTiled` supports rank 1-5. Rank 2+ works with byte
			`strides. Rank 1 works with element strides (no strides to convert).`

			`For 2D+ descriptors, always use byte strides.`

			### 3. `BFLOAT16` data type is now available

			`CU_TENSOR_MAP_DATA_TYPE_BFLOAT16` exists in CUDA 13. Use it instead of
			`CU_TENSOR_MAP_DATA_TYPE_UINT16` for BF16 tensors.

			`## TMA Descriptor Creation via Driver API — KNOWN ISSUE`

			`On driver 580.126.20 (CUDA 13.0) + toolkit 13.2:`
			- `cuTensorMapEncodeTiled` succeeds for 2D/3D descriptors with byte strides ✅
			- BUT `cp.async.bulk.tensor.{2d,3d}` PTX instruction HANGS with these descriptors ❌
			`- mbarrier never signals completion`

			`Root cause: likely a descriptor format mismatch. The toolkit 13.2`
			`cuTensorMapEncodeTiled` may produce descriptors that the driver 13.0
			`TMA hardware can't read. CUTLASS has driver-version-specific workarounds`
			(see `copy_traits_sm90_tma.hpp` — they clear bit 21 of desc[1] for
			`driver <= 13.1 with small tensors).`

			Working path: Use CuTeDSL's `tma_partition` to create descriptors.
			`CuTeDSL handles the driver version internally and produces descriptors`
			`that the GPU TMA hardware accepts.`

			`## 3D Descriptors (recommended for CUDA 13)`

			`Use 3D descriptors with degenerate 3rd dimension = 1:`
			```c
			`uint64_t gd[] = {cols, rows, 1};`
			`uint64_t gs[] = {cols * 2, cols * 2 * rows}; // byte strides`
			`uint32_t td[] = {tile_cols, tile_rows, 1};`
			`uint32_t ts[] = {1, 1, 1}; // element strides within tile`
			```

			Kernel uses `cp.async.bulk.tensor.3d` with coordinates {x, y, 0}.

			`## mbarrier for TMA`

			For `complete_tx::bytes` mode:
			- `mbarrier.init` expected count = number of bytes to transfer
			`(e.g., 128 * 16 * 2 = 4096 for a (128,16) BF16 tile)`
			`- OR count = 1 (some implementations use this)`

			`Both have been tested; the hang is NOT caused by the mbarrier count.`

			`## Files Archived`

			`The driver-API TMA implementation is archived at:`
			- `dsv4/kernels/attention/archive/fmha_tma_driver_api.cuh` — descriptor helpers
			- `dsv4/kernels/attention/archive/fmha_6warp_tma_driver_api.cuh` — TMA kernel
			- `tests/unit/archive/test_fmha_tma_driver_api.cu` — test

			`These work correctly EXCEPT for the descriptor format issue.`
			`When the B200 driver is updated to 13.2+, these may work directly.`