Files
nvfp4-megamoe-kernel/docs/cuda13_tma_notes.md
biondizzle a40c05f3f2 archive: TMA driver-API files + CUDA 13 TMA discovery notes
Key findings documented in docs/cuda13_tma_notes.md:
- CUDA 13 globalStrides are in BYTES not elements (root cause of desc creation failures)
- BFLOAT16 data type available in CUDA 13
- Driver API descriptors create OK but cp.async.bulk.tensor hangs on driver 13.0 + toolkit 13.2
- CuTeDSL tma_partition works (production path)

Archived (not deleted):
- fmha_tma_driver_api.cuh, fmha_6warp_tma_driver_api.cuh, test_fmha_tma_driver_api.cu
- These will work once driver matches toolkit version
2026-05-29 06:52:39 +00:00

2.9 KiB

CUDA 13 TMA Descriptor Notes — CRITICAL REFERENCE

Date: 2026-05-29

Status: Verified on B200 (driver 580.126.20 = CUDA 13.0, toolkit 13.2, SM100)

Three Breaking Changes from CUDA 12 → CUDA 13

1. globalStrides are now in BYTES, not elements

CUDA 12:

uint64_t gs[] = {1, cols};  // element strides

CUDA 13:

uint64_t gs[] = {cols * 2, cols * 2 * rows};  // byte strides (for BF16)

This was the root cause of ALL cuTensorMapEncodeTiled failures returning INVALID_VALUE (error=1). The old element-based strides produce byte values (1, 64) which aren't multiples of 16 — violating the constraint that globalStrides[i] must be a multiple of 16 bytes.

2. tensorRank minimum is 2 (1D still works but limited)

CUDA 13 cuTensorMapEncodeTiled supports rank 1-5. Rank 2+ works with byte strides. Rank 1 works with element strides (no strides to convert).

For 2D+ descriptors, always use byte strides.

3. BFLOAT16 data type is now available

CU_TENSOR_MAP_DATA_TYPE_BFLOAT16 exists in CUDA 13. Use it instead of CU_TENSOR_MAP_DATA_TYPE_UINT16 for BF16 tensors.

TMA Descriptor Creation via Driver API — KNOWN ISSUE

On driver 580.126.20 (CUDA 13.0) + toolkit 13.2:

  • cuTensorMapEncodeTiled succeeds for 2D/3D descriptors with byte strides
  • BUT cp.async.bulk.tensor.{2d,3d} PTX instruction HANGS with these descriptors
  • mbarrier never signals completion

Root cause: likely a descriptor format mismatch. The toolkit 13.2 cuTensorMapEncodeTiled may produce descriptors that the driver 13.0 TMA hardware can't read. CUTLASS has driver-version-specific workarounds (see copy_traits_sm90_tma.hpp — they clear bit 21 of desc[1] for driver <= 13.1 with small tensors).

Working path: Use CuTeDSL's tma_partition to create descriptors. CuTeDSL handles the driver version internally and produces descriptors that the GPU TMA hardware accepts.

Use 3D descriptors with degenerate 3rd dimension = 1:

uint64_t gd[] = {cols, rows, 1};
uint64_t gs[] = {cols * 2, cols * 2 * rows};  // byte strides
uint32_t td[] = {tile_cols, tile_rows, 1};
uint32_t ts[] = {1, 1, 1};  // element strides within tile

Kernel uses cp.async.bulk.tensor.3d with coordinates {x, y, 0}.

mbarrier for TMA

For complete_tx::bytes mode:

  • mbarrier.init expected count = number of bytes to transfer (e.g., 128 * 16 * 2 = 4096 for a (128,16) BF16 tile)
  • OR count = 1 (some implementations use this)

Both have been tested; the hang is NOT caused by the mbarrier count.

Files Archived

The driver-API TMA implementation is archived at:

  • dsv4/kernels/attention/archive/fmha_tma_driver_api.cuh — descriptor helpers
  • dsv4/kernels/attention/archive/fmha_6warp_tma_driver_api.cuh — TMA kernel
  • tests/unit/archive/test_fmha_tma_driver_api.cu — test

These work correctly EXCEPT for the descriptor format issue. When the B200 driver is updated to 13.2+, these may work directly.