Both indexer files now use a constexpr LUT matching Python's
E2M1_MAGNITUDES = [0, 0.5, 1, 1.5, 2, 3, 4, 6].
This is cleaner and more auditable than bit-manipulation.
The dequant_fp4_scalar function was treating the magnitude bits as
a raw integer (0-6) instead of the E2M1 floating-point format:
Old (WRONG): val = (int)(nibble & 0x07) * scale
New (CORRECT): proper E2M1 decode with exponent + mantissa
E2M1 encoding (bias=1):
exp=0 subnormal: 0b000=0, 0b001=0.5
exp=1: 0b010=1, 0b011=1.5
exp=2: 0b100=2, 0b101=3
exp=3: 0b110=4, 0b111=6
Bug found by outside consultant. Affects indexer top-k selection
correctness — wrong FP4 key decoding would select wrong CSA blocks.
Fixed in both:
- dsv4/kernels/indexer/indexer_score_topk.cu
- dsv4/kernels/cuda/indexer_score_topk.cu
Production FMHA kernel template for Blackwell SM100:
- FmhaSm100Kernel<HD>::launch(q, k, v, o, s_k, scale, stream)
- QK: SS MMA N=128, one K-tile at a time
- PV: SS MMA N=16 sub-tiles (HD/16 calls per K-tile)
- Epilogue: TMEM → regs → BF16 → GMEM
- ~25KB SMEM for all HD values
- All HD=16/64/128/256 pass with cos 0.999997+
- LBO = BLOCK_MN * 16 (bytes), SBO = 128 (bytes) for K-major NONE
- Canonical SMEM layout: column-major interleaving of core matrices
- idesc is SEPARATE 32-bit value (was using desc_a>>32 = WRONG)
- idesc encodes dtype/atype/btype/MMA_M/MMA_N
- This was the root cause of 'misaligned address' errors
Canonical UMMA layout for SWIZZLE_NONE:
- MN-major (128, 64): LBO=16, SBO=128 (from logical_divide Tile(1,8))
- K-major (128, 64): LBO=16, SBO=32 (from logical_divide Tile(8,2))
Using simple row-major SMEM layout (no swizzle, no permutation).
Data is written directly to SMEM in row-major order.
The descriptor strides describe the canonical layout.
Proper implementation of the SMEM layout that tcgen05.mma expects:
- SWIZZLE_128B (layout_type=2) for both MN-major A and K-major B
- Swizzle<3,4,3> applied to element offsets before SMEM write
- MN_SW128 atom: (1024, 8) BF16, stride (1, 1024)
- K_SW128 atom: (8, 1024) BF16, stride (1, 8)
- umma_smem_write/read functions for both MN and K major
- Descriptor with correct leading_byte_offset and stride_byte_offset
This is the RIGHT WAY. No shortcuts.
The descriptor bitfield is completely different from what I assumed:
- [0,14) start_address (smem_ptr >> 4)
- [16,30) leading_byte_offset (row stride bytes >> 4)
- [32,46) stride_byte_offset
- [46,48) version = 1 (Blackwell)
- [61,64) layout_type (0=NONE, 2=128B, 4=64B, 6=32B)
- idescE = desc >> 32, passed as separate arg to MMA PTX
The 64-bit descriptor uses byte offsets (not log2 or element counts).
The upper 32 bits are reinterpreted by the MMA hardware as idescE.
Step 1 of tensor-core acceleration:
- fmha_umma_desc.cuh: UMMA SMEM descriptor construction (raw bitfield)
- fmha_qk_verify.cuh: QK GEMM using tcgen05.mma SS (SMEM A, SMEM B → TMEM C)
- test_qk_mma.cu: standalone test comparing MMA output vs CPU reference
Key design decisions:
- UMMA descriptors built from raw bitfield (no CuTe dependency)
- tcgen05.mma called by one lane per warp (elect_one_sync pattern)
- Q: (128, HD) MN-major, K: (128, HD) K-major (transposed via descriptor)
- S: (128, 128) in TMEM, row 0 read back via tcgen05.ld
Updated fmha_common.cuh, fmha_sm100.cuh, fmha_epilogue_sm100.cuh,
and fmha_sm100_launch.cuh with comprehensive here-docs explaining:
1. The 4 CuTeDSL gaps that forced us to raw CUDA C++:
- TMEM round-trip broken (Ld32x32bOp/St32x32bOp column mismatch)
- Float→int impossible (arith.fptosi not lowerable)
- epilogue_tma_store blocks multi-CTA
- hd=512 MLIR optimizer hangs
2. TMEM lane mapping (verified on B200):
- Lane i → positions i*4+0..3, 128 FP32 per column
- Warp-collective: ALL 32 lanes must call ld/st or HANG
- Column address = tmem_base + column_index
3. Key insight for NVIDIA: float→int gap is the single most
impactful limitation, blocking ALL quantization-epilogue fusion
Removed all dead code from the first (broken) attention loop approach.
Clean pipeline: SMEM attention → TMEM write → TMEM read → normalize → GMEM.
Also renamed sPvBuf to sO for clarity (same as reference kernel).
CRITICAL FIX: tcgen05.st 16x256b.x1.b32 is warp-collective where:
- Lane i writes to positions i*4+0..i*4+3 within the column
- 32 lanes × 4 FP32 = 128 FP32 per column
- For row 0: lane 0 = positions 0-3, lane 1 = 4-7, ..., lane 31 = 124-127
Old code iterated col = lane; col < N; col += 32, treating each lane
as owning a separate column. That was WRONG — all 32 lanes share each
column, each owning 4 positions within it.
New code: HD values need ceil(HD/128) columns. Lane i writes
sPvBuf[i*4+0..3] to column 0 (or column 1 for HD > 128).
Verified via test_tmem_lane_mapping.cu on B200.
Root cause of TMEM epilogue hang: tmem_store/tmem_load are
warp-collective operations requiring ALL 32 lanes to participate.
The loop 'for (col = lane; col < TMEM_O_COLS; col += WARP)' with
TMEM_O_COLS=16 and WARP=32 means only lanes 0-15 execute the op.
Lanes 16-31 skip it = warp divergence on collective = HANG.
Fix: loop over TMEM_N (>= 32, power of 2) so all 32 lanes
participate. Columns beyond TMEM_O_COLS write don't-care data
to allocated-but-unused TMEM columns.
Key fixes for fmha_epilogue_sm100.cuh hang:
- tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute
- Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG
- tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer
- Compute attention in SMEM, then do one-way TMEM pipeline:
SMEM → TMEM (warp-collective store) → regs (warp-collective load)
→ normalize in regs → BF16 cast → GMEM
- This proves the MoE-style one-way correction epilogue on FMHA
Also: enable TMEM kernel test + hd=128 in standalone test
What changed:
- Moved fmha_backup_pre_epilog.py, fmha_backup_v2.py, fmha_smem_acc.py to archive/
- Deleted fmha.py.backup (git has history)
- Added detailed heredoc headers to ALL files documenting:
* WHAT WORKS and WHAT'S BROKEN
* WHY each limitation exists (CuTeDSL toolchain gaps)
* KEY INSIGHTS FOR NVIDIA (what CuTeDSL is missing)
* What each file unblocks if fixed
File status:
fmha.py — CuTeDSL FMHA, cos 0.999998, D1.5 workaround
fmha_common.cuh — Raw CUDA shared defs (BF16, TMEM ops)
fmha_sm100.cuh — Raw CUDA reference, cos 0.999999
fmha_epilogue_sm100.cuh — Raw CUDA TMEM epilogue, HANGS (needs debug)
fmha_sm100_launch.cu — PyTorch binding (JIT broken, nvcc works)
production.py — CuTeDSL production wrapper (partial)
archive/ — Historical backups with explanation headers