Key findings documented in docs/cuda13_tma_notes.md: - CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures) - BFLOAT16 data type available in CUDA 13 - Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2 - CuTeDSL tma_partition works (production path) Archived (not deleted): - fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu - These will work once driver matches toolkit version
2.9 KiB
CUDA 13 TMA Descriptor Notes — CRITICAL REFERENCE
Date: 2026-05-29
Status: Verified on B200 (driver 580.126.20 = CUDA 13.0, toolkit 13.2, SM100)
Three Breaking Changes from CUDA 12 → CUDA 13
1. globalStrides are now in BYTES, not elements
CUDA 12:
uint64_t gs[] = {1, cols}; // element strides
CUDA 13:
uint64_t gs[] = {cols * 2, cols * 2 * rows}; // byte strides (for BF16)
This was the root cause of ALL cuTensorMapEncodeTiled failures returning
INVALID_VALUE (error=1). The old element-based strides produce byte values
(1, 64) which aren't multiples of 16 — violating the constraint that
globalStrides[i] must be a multiple of 16 bytes.
2. tensorRank minimum is 2 (1D still works but limited)
CUDA 13 cuTensorMapEncodeTiled supports rank 1-5. Rank 2+ works with byte
strides. Rank 1 works with element strides (no strides to convert).
For 2D+ descriptors, always use byte strides.
3. BFLOAT16 data type is now available
CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 exists in CUDA 13. Use it instead of
CU_TENSOR_MAP_DATA_TYPE_UINT16 for BF16 tensors.
TMA Descriptor Creation via Driver API — KNOWN ISSUE
On driver 580.126.20 (CUDA 13.0) + toolkit 13.2:
cuTensorMapEncodeTiledsucceeds for 2D/3D descriptors with byte strides ✅- BUT
cp.async.bulk.tensor.{2d,3d}PTX instruction HANGS with these descriptors ❌ - mbarrier never signals completion
Root cause: likely a descriptor format mismatch. The toolkit 13.2
cuTensorMapEncodeTiled may produce descriptors that the driver 13.0
TMA hardware can't read. CUTLASS has driver-version-specific workarounds
(see copy_traits_sm90_tma.hpp — they clear bit 21 of desc[1] for
driver <= 13.1 with small tensors).
Working path: Use CuTeDSL's tma_partition to create descriptors.
CuTeDSL handles the driver version internally and produces descriptors
that the GPU TMA hardware accepts.
3D Descriptors (recommended for CUDA 13)
Use 3D descriptors with degenerate 3rd dimension = 1:
uint64_t gd[] = {cols, rows, 1};
uint64_t gs[] = {cols * 2, cols * 2 * rows}; // byte strides
uint32_t td[] = {tile_cols, tile_rows, 1};
uint32_t ts[] = {1, 1, 1}; // element strides within tile
Kernel uses cp.async.bulk.tensor.3d with coordinates {x, y, 0}.
mbarrier for TMA
For complete_tx::bytes mode:
mbarrier.initexpected count = number of bytes to transfer (e.g., 128 * 16 * 2 = 4096 for a (128,16) BF16 tile)- OR count = 1 (some implementations use this)
Both have been tested; the hang is NOT caused by the mbarrier count.
Files Archived
The driver-API TMA implementation is archived at:
dsv4/kernels/attention/archive/fmha_tma_driver_api.cuh— descriptor helpersdsv4/kernels/attention/archive/fmha_6warp_tma_driver_api.cuh— TMA kerneltests/unit/archive/test_fmha_tma_driver_api.cu— test
These work correctly EXCEPT for the descriptor format issue. When the B200 driver is updated to 13.2+, these may work directly.