Files

biondizzle a40c05f3f2 archive: TMA driver-API files + CUDA 13 TMA discovery notes

Key findings documented in docs/cuda13_tma_notes.md:
- CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures)
- BFLOAT16 data type available in CUDA 13
- Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2
- CuTeDSL tma_partition works (production path)

Archived (not deleted):
- fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu
- These will work once driver matches toolkit version

2026-05-29 06:52:39 +00:00

2.9 KiB

Raw Blame History

CUDA 13 TMA Descriptor Notes — CRITICAL REFERENCE

Date: 2026-05-29

Status: Verified on B200 (driver 580.126.20 = CUDA 13.0, toolkit 13.2, SM100)

Three Breaking Changes from CUDA 12 → CUDA 13

1. `globalStrides` are now in BYTES, not elements

CUDA 12:

uint64_t gs[] = {1, cols};  // element strides

CUDA 13:

uint64_t gs[] = {cols * 2, cols * 2 * rows};  // byte strides (for BF16)

This was the root cause of ALL cuTensorMapEncodeTiled failures returning INVALID_VALUE (error=1). The old element-based strides produce byte values (1, 64) which aren't multiples of 16 — violating the constraint that globalStrides[i] must be a multiple of 16 bytes.

2. `tensorRank` minimum is 2 (1D still works but limited)

CUDA 13 cuTensorMapEncodeTiled supports rank 1-5. Rank 2+ works with byte strides. Rank 1 works with element strides (no strides to convert).

For 2D+ descriptors, always use byte strides.

3. `BFLOAT16` data type is now available

CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 exists in CUDA 13. Use it instead of CU_TENSOR_MAP_DATA_TYPE_UINT16 for BF16 tensors.

TMA Descriptor Creation via Driver API — KNOWN ISSUE

On driver 580.126.20 (CUDA 13.0) + toolkit 13.2:

cuTensorMapEncodeTiled succeeds for 2D/3D descriptors with byte strides ✅
BUT cp.async.bulk.tensor.{2d,3d} PTX instruction HANGS with these descriptors ❌
mbarrier never signals completion

Root cause: likely a descriptor format mismatch. The toolkit 13.2 cuTensorMapEncodeTiled may produce descriptors that the driver 13.0 TMA hardware can't read. CUTLASS has driver-version-specific workarounds (see copy_traits_sm90_tma.hpp — they clear bit 21 of desc[1] for driver <= 13.1 with small tensors).

Working path: Use CuTeDSL's tma_partition to create descriptors. CuTeDSL handles the driver version internally and produces descriptors that the GPU TMA hardware accepts.

3D Descriptors (recommended for CUDA 13)

Use 3D descriptors with degenerate 3rd dimension = 1:

uint64_t gd[] = {cols, rows, 1};
uint64_t gs[] = {cols * 2, cols * 2 * rows};  // byte strides
uint32_t td[] = {tile_cols, tile_rows, 1};
uint32_t ts[] = {1, 1, 1};  // element strides within tile

Kernel uses cp.async.bulk.tensor.3d with coordinates {x, y, 0}.

mbarrier for TMA

For complete_tx::bytes mode:

mbarrier.init expected count = number of bytes to transfer (e.g., 128 * 16 * 2 = 4096 for a (128,16) BF16 tile)
OR count = 1 (some implementations use this)

Both have been tested; the hang is NOT caused by the mbarrier count.

Files Archived

The driver-API TMA implementation is archived at:

dsv4/kernels/attention/archive/fmha_tma_driver_api.cuh — descriptor helpers
dsv4/kernels/attention/archive/fmha_6warp_tma_driver_api.cuh — TMA kernel
tests/unit/archive/test_fmha_tma_driver_api.cu — test

These work correctly EXCEPT for the descriptor format issue. When the B200 driver is updated to 13.2+, these may work directly.

2.9 KiB Raw Blame History