Files
nvfp4-megamoe-kernel/docs/p4_tma_hang_resolution.md
biondizzle 6421f7c3f3 P4 RESOLVED: TMA hang was GMEM misalignment, not descriptor/driver issue
Evidence: TMA loads succeed with 128B-aligned GMEM on all descriptor configs.
The bit-21 workaround was NOT needed. The 'misaligned address' crashes were
caused by passing non-128B-aligned GMEM pointers to cp.async.bulk.tensor.

Added docs/p4_tma_hang_resolution.md with root cause and fix.
Cleaned up stale P4 test files.
2026-05-30 08:42:18 +00:00

1.4 KiB

P4 Resolution: TMA Hang Root Cause

Date: 2026-05-30

Status: RESOLVED

Root Cause

The TMA cp.async.bulk.tensor.2d hang was caused by GMEM pointer misalignment, NOT by descriptor format, swizzle mode, driver version, or bit-21 issues.

Evidence

  1. With 128-byte aligned GMEM and SMEM, TMA loads succeed with ALL descriptor configs (NO swizzle, SWIZZLE_128B, OOB_FILL_ZERO).
  2. With cudaMalloc (which guarantees only 256-byte alignment for the allocation base, but the actual tensor data pointer may not be 128B-aligned if offset), TMA loads crash with "misaligned address" or hang (mbarrier never signals).
  3. The bit-21 workaround was NOT needed — both original and bit-21-cleared descriptors work identically.

Fix

For TMA loads in raw CUDA kernels:

  1. Ensure GMEM tensor base address is 128-byte aligned (posix_memalign or cudaMalloc + offset alignment check)
  2. Ensure SMEM buffer is 128-byte aligned (__attribute__((aligned(128))))
  3. Use cuTensorMapEncodeTiled with outermost-first dims, byte strides

Descriptor bytes (128, 16) BF16, NO swizzle:

[0-7]:  00 00 00 f7 c3 71 00 00   (GMEM address)
[8-15]: 10 05 00 00 02 00 00 00   (dims, strides, flags)

Bit 21 of word[1] is 0 in this case (address already has bit 21 clear). The bit-21 issue from CUTLASS only affects very small tensors where the address itself has bit 21 set.