Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs. The bit-21 workaround was NOT needed. The 'misaligned address' crashes were caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor. Added docs/p4_tma_hang_resolution.md with root cause and fix. Cleaned up stale P4 test files.
32 lines
1.4 KiB
Markdown
32 lines
1.4 KiB
Markdown
# P4 Resolution: TMA Hang Root Cause
|
|
|
|
## Date: 2026-05-30
|
|
## Status: RESOLVED
|
|
|
|
## Root Cause
|
|
|
|
The TMA `cp.async.bulk.tensor.2d` hang was caused by **GMEM pointer misalignment**, NOT by descriptor format, swizzle mode, driver version, or bit-21 issues.
|
|
|
|
## Evidence
|
|
|
|
1. With 128-byte aligned GMEM and SMEM, TMA loads succeed with ALL descriptor configs (NO swizzle, SWIZZLE_128B, OOB_FILL_ZERO).
|
|
2. With `cudaMalloc` (which guarantees only 256-byte alignment for the allocation base, but the actual tensor data pointer may not be 128B-aligned if offset), TMA loads crash with "misaligned address" or hang (mbarrier never signals).
|
|
3. The bit-21 workaround was NOT needed — both original and bit-21-cleared descriptors work identically.
|
|
|
|
## Fix
|
|
|
|
For TMA loads in raw CUDA kernels:
|
|
1. Ensure GMEM tensor base address is 128-byte aligned (`posix_memalign` or `cudaMalloc` + offset alignment check)
|
|
2. Ensure SMEM buffer is 128-byte aligned (`__attribute__((aligned(128)))`)
|
|
3. Use `cuTensorMapEncodeTiled` with outermost-first dims, byte strides
|
|
|
|
## Descriptor bytes (128, 16) BF16, NO swizzle:
|
|
```
|
|
[0-7]: 00 00 00 f7 c3 71 00 00 (GMEM address)
|
|
[8-15]: 10 05 00 00 02 00 00 00 (dims, strides, flags)
|
|
```
|
|
|
|
Bit 21 of word[1] is 0 in this case (address already has bit 21 clear).
|
|
The bit-21 issue from CUTLASS only affects very small tensors where the
|
|
address itself has bit 21 set.
|