Commit Graph

388 Commits

Author SHA1 Message Date
0f6907b001 UMMA: fix descriptor + idesc — use gau-nernst tutorial values
- LBO = BLOCK_MN * 16 (bytes), SBO = 128 (bytes) for K-major NONE
- Canonical SMEM layout: column-major interleaving of core matrices
- idesc is SEPARATE 32-bit value (was using desc_a>>32 = WRONG)
- idesc encodes dtype/atype/btype/MMA_M/MMA_N
- This was the root cause of 'misaligned address' errors
2026-05-28 09:18:45 +00:00
427410d94a UMMA: Rewrite fmha_umma_desc.cuh with correct K-major core-matrix layout + minimal QK GEMM test
- Core-matrix layout: each 8x8 BF16 tile (128B) contiguous in SMEM
- K-major NONE descriptor: LBO=1 (16B), SBO=block_k/8, lbo_mode=0
- MMA K-tiling: tcgen05.mma uses K=16 per call, tile for hd>16
- write_smem_kmajor: converts row-major to core-matrix layout
- write_smem_ktile: extracts single K-tile in core-matrix layout
- test_umma_qk.cu: minimal hd=16, sk=128 test (single MMA call)
- Previous UMMA descriptors were wrong (row-major SMEM, wrong LBO/SBO)
2026-05-28 09:15:40 +00:00
e5ba0ca119 debug: clean QK verify with scalar sanity + MMA result 2026-05-28 08:53:35 +00:00
a04d794979 debug: skip TMEM alloc — test SMEM loads only 2026-05-28 08:49:37 +00:00
72c97f2546 debug: minimal UMMA descriptor (just start_addr + version) 2026-05-28 08:48:01 +00:00
9a51bfa578 fix: align SMEM layout properly (128B aligned tmem + Q) 2026-05-28 08:46:56 +00:00
c64bd7b875 debug: read Q/K directly from SMEM 2026-05-28 08:43:39 +00:00
58b610c96c fix: proper early return for SMEM load test 2026-05-28 08:41:30 +00:00
82bc2c4a49 debug: verify SMEM loads + scalar QK sanity check 2026-05-28 08:40:16 +00:00
53139d24bf debug: verify TMEM r/w works before MMA 2026-05-28 08:39:12 +00:00
a9d71ff6ab debug: print TMEM values after MMA 2026-05-28 08:38:08 +00:00
bfb1e177ce debug: try all-lane MMA + print tmem_base 2026-05-28 08:37:02 +00:00
d3510980e4 feat: SWIZZLE_NONE UMMA descriptors with row-major SMEM
Canonical UMMA layout for SWIZZLE_NONE:
- MN-major (128, 64): LBO=16, SBO=128 (from logical_divide Tile(1,8))
- K-major (128, 64): LBO=16, SBO=32 (from logical_divide Tile(8,2))

Using simple row-major SMEM layout (no swizzle, no permutation).
Data is written directly to SMEM in row-major order.
The descriptor strides describe the canonical layout.
2026-05-28 08:35:30 +00:00
ab84ad0f86 feat: implement canonical UMMA SMEM layout with SWIZZLE_128B
Proper implementation of the SMEM layout that tcgen05.mma expects:
- SWIZZLE_128B (layout_type=2) for both MN-major A and K-major B
- Swizzle<3,4,3> applied to element offsets before SMEM write
- MN_SW128 atom: (1024, 8) BF16, stride (1, 1024)
- K_SW128 atom: (8, 1024) BF16, stride (1, 8)
- umma_smem_write/read functions for both MN and K major
- Descriptor with correct leading_byte_offset and stride_byte_offset

This is the RIGHT WAY. No shortcuts.
2026-05-28 08:18:47 +00:00
ecbc75255c fix: correct UMMA descriptor format from CUTLASS source
The descriptor bitfield is completely different from what I assumed:
- [0,14) start_address (smem_ptr >> 4)
- [16,30) leading_byte_offset (row stride bytes >> 4)
- [32,46) stride_byte_offset
- [46,48) version = 1 (Blackwell)
- [61,64) layout_type (0=NONE, 2=128B, 4=64B, 6=32B)
- idescE = desc >> 32, passed as separate arg to MMA PTX

The 64-bit descriptor uses byte offsets (not log2 or element counts).
The upper 32 bits are reinterpreted by the MMA hardware as idescE.
2026-05-28 08:07:52 +00:00
fe7d561143 debug: print UMMA descriptor values for diagnosis 2026-05-28 08:03:53 +00:00
c5f7a9a15c fix: align SMEM buffers to 16 bytes for UMMA descriptors 2026-05-28 08:02:53 +00:00
7436315309 feat: add tcgen05.mma QK GEMM verification kernel + test
Step 1 of tensor-core acceleration:
- fmha_umma_desc.cuh: UMMA SMEM descriptor construction (raw bitfield)
- fmha_qk_verify.cuh: QK GEMM using tcgen05.mma SS (SMEM A, SMEM B → TMEM C)
- test_qk_mma.cu: standalone test comparing MMA output vs CPU reference

Key design decisions:
- UMMA descriptors built from raw bitfield (no CuTe dependency)
- tcgen05.mma called by one lane per warp (elect_one_sync pattern)
- Q: (128, HD) MN-major, K: (128, HD) K-major (transposed via descriptor)
- S: (128, 128) in TMEM, row 0 read back via tcgen05.ld
2026-05-28 08:00:42 +00:00
6fb3d54c02 docs: update here-docs with CuTeDSL rationale for NVIDIA
Updated fmha_common.cuh, fmha_sm100.cuh, fmha_epilogue_sm100.cuh,
and fmha_sm100_launch.cuh with comprehensive here-docs explaining:

1. The 4 CuTeDSL gaps that forced us to raw CUDA C++:
   - TMEM round-trip broken (Ld32x32bOp/St32x32bOp column mismatch)
   - Float→int impossible (arith.fptosi not lowerable)
   - epilogue_tma_store blocks multi-CTA
   - hd=512 MLIR optimizer hangs

2. TMEM lane mapping (verified on B200):
   - Lane i → positions i*4+0..3, 128 FP32 per column
   - Warp-collective: ALL 32 lanes must call ld/st or HANG
   - Column address = tmem_base + column_index

3. Key insight for NVIDIA: float→int gap is the single most
   impactful limitation, blocking ALL quantization-epilogue fusion
2026-05-28 07:54:01 +00:00
446a0ca9fd refactor(tmem): clean rewrite of TMEM epilogue kernel
Removed all dead code from the first (broken) attention loop approach.
Clean pipeline: SMEM attention → TMEM write → TMEM read → normalize → GMEM.

Also renamed sPvBuf to sO for clarity (same as reference kernel).
2026-05-28 07:49:03 +00:00
c989dc78d9 debug: print sPvBuf[32] value 2026-05-28 07:47:37 +00:00
b50f6a8512 debug: add TMEM read diagnostic 2026-05-28 07:46:15 +00:00
579dd061cd fix: remove duplicate TMEM_COLS_NEEDED declarations 2026-05-28 07:43:54 +00:00
278f1b34af fix(tmem): correct lane-to-position mapping for tcgen05.ld/st
CRITICAL FIX: tcgen05.st 16x256b.x1.b32 is warp-collective where:
- Lane i writes to positions i*4+0..i*4+3 within the column
- 32 lanes × 4 FP32 = 128 FP32 per column
- For row 0: lane 0 = positions 0-3, lane 1 = 4-7, ..., lane 31 = 124-127

Old code iterated col = lane; col < N; col += 32, treating each lane
as owning a separate column. That was WRONG — all 32 lanes share each
column, each owning 4 positions within it.

New code: HD values need ceil(HD/128) columns. Lane i writes
sPvBuf[i*4+0..3] to column 0 (or column 1 for HD > 128).

Verified via test_tmem_lane_mapping.cu on B200.
2026-05-28 07:43:40 +00:00
33cedbee0a fix(tmem): TMEM ld/st are warp-collective — ALL 32 lanes must call them
Root cause of TMEM epilogue hang: tmem_store/tmem_load are
warp-collective operations requiring ALL 32 lanes to participate.

The loop 'for (col = lane; col < TMEM_O_COLS; col += WARP)' with
TMEM_O_COLS=16 and WARP=32 means only lanes 0-15 execute the op.
Lanes 16-31 skip it = warp divergence on collective = HANG.

Fix: loop over TMEM_N (>= 32, power of 2) so all 32 lanes
participate. Columns beyond TMEM_O_COLS write don't-care data
to allocated-but-unused TMEM columns.
2026-05-28 07:41:16 +00:00
cea02fe407 fix: add cstdio for printf in TMEM debug 2026-05-28 07:40:04 +00:00
0ddcc6bafd debug: add printf to TMEM kernel to find hang point 2026-05-28 07:39:53 +00:00
2eb44a00bf fix(tmem): warp-collective TMEM ops + one-way correction epilogue
Key fixes for fmha_epilogue_sm100.cuh hang:
- tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute
- Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG
- tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer
- Compute attention in SMEM, then do one-way TMEM pipeline:
  SMEM → TMEM (warp-collective store) → regs (warp-collective load)
  → normalize in regs → BF16 cast → GMEM
- This proves the MoE-style one-way correction epilogue on FMHA

Also: enable TMEM kernel test + hd=128 in standalone test
2026-05-28 07:27:25 +00:00
4fe9bbab48 add back in the archived code 2026-05-28 07:04:59 +00:00
4336de9372 attention/: Clean up folder, archive backups, add detailed status headers
What changed:
- Moved fmha_backup_pre_epilog.py, fmha_backup_v2.py, fmha_smem_acc.py to archive/
- Deleted fmha.py.backup (git has history)
- Added detailed heredoc headers to ALL files documenting:
  * WHAT WORKS and WHAT'S BROKEN
  * WHY each limitation exists (CuTeDSL toolchain gaps)
  * KEY INSIGHTS FOR NVIDIA (what CuTeDSL is missing)
  * What each file unblocks if fixed

File status:
  fmha.py                 — CuTeDSL FMHA, cos 0.999998, D1.5 workaround
  fmha_common.cuh         — Raw CUDA shared defs (BF16, TMEM ops)
  fmha_sm100.cuh          — Raw CUDA reference, cos 0.999999
  fmha_epilogue_sm100.cuh — Raw CUDA TMEM epilogue, HANGS (needs debug)
  fmha_sm100_launch.cu    — PyTorch binding (JIT broken, nvcc works)
  production.py           — CuTeDSL production wrapper (partial)
  archive/                — Historical backups with explanation headers
2026-05-28 07:01:33 +00:00
a391615f60 fix: uint64_t for SMEM pointer 2026-05-28 06:39:19 +00:00
b4779e3f48 fix: cvta.to.shared.u64 for 64-bit SMEM pointers 2026-05-28 06:37:52 +00:00
cf264bd0e2 fix: cvta.shared.u32 (not cvta.to.shared) 2026-05-28 06:36:50 +00:00
771799e112 FMHA SM100: Fix TMEM operations — uint32_t registers, correct PTX syntax
TMEM load/store uses b32 (uint32_t) registers, NOT float.
Bitcast float↔uint32_t for FP32 TMEM values.
TMEM alloc takes SMEM pointer (not a return value).
TMEM column addressing: col + row_group * tmem_n.
2026-05-28 06:35:50 +00:00
e173295a3a FMHA SM100: Refactor into common + reference + TMEM epilogue headers
- fmha_common.cuh: BF16, TMEM ops, warp reductions (shared)
- fmha_sm100.cuh: Phase 1 reference (SMEM-based, cos 0.999999)
- fmha_epilogue_sm100.cuh: Phase 2 TMEM+correction epilogue (Priority 2)
- Test both kernels at hd=64 and hd=128
2026-05-28 06:31:05 +00:00
bcc5d0b6cb FMHA SM100: Add TMEM+correction epilogue kernel (Priority 2)
New file: fmha_epilogue_sm100.cuh
- TMEM alloc/dealloc/load/store via tcgen05 PTX
- One-way correction epilogue: TMEM→regs→normalize→BF16→GMEM
- D1.5 fix: O rescale in REGISTERS (TMEM→regs→multiply→TMEM)
- Same pattern as MoE epilogue but with normalize instead of SwiGLU
- Unblocks D2 multi-CTA and NVFP4-1.2 (register slot for FP4 pack)

Test: hd=64 + hd=128, reference vs TMEM kernels
2026-05-28 06:27:56 +00:00
8eb735618f fix: use expf for softmax (not exp2f with scale) 2026-05-28 05:34:03 +00:00
3cb339129b FMHA SM100: Fix Phase 1 — single-thread reference for correctness
Use thread 0 for all computation (slow but correct).
SMEM for Q and O sharing across threads.
Online softmax with O rescale — correct D1.5 approach.
D3 SWA mask implemented.
Target: cos ~0.999998 then parallelize.
2026-05-28 05:32:47 +00:00
77fa34a9a6 fix: update launch wrapper for fmha_decode_ref 2026-05-28 05:28:49 +00:00
00ac46c9d3 FMHA SM100: Phase 1 — reference scalar implementation
Simpler approach first: scalar Q@K^T, softmax, P@V in registers.
No TMEM/MMA yet — verify correctness first, then replace with tcgen05.

- 192-thread CTA, all threads cooperate on one (batch, head)
- Online softmax with O rescale (correct D1.5 approach)
- D3 SWA mask, D4 causal (TODO), D5c sink (TODO)
- KV loaded in blocks of 128 for SMEM efficiency
- Correctness target: cos ~0.999998 against PyTorch reference
2026-05-28 05:27:36 +00:00
6f7449ce71 FMHA SM100: Fix tcgen05.mma PTX syntax — correct register constraints
- tcgen05.mma.cta_group::1.kind::f16 [tmem_c], desc_a, desc_b, idescE_hi, scaleC, {mask0..3}, pred
- idescE is upper 32 bits of the E descriptor
- scaleC is a float (1.0 for accumulate)
- mask is 4 uint32 values (0xFFFFFFFF for no masking)
2026-05-28 05:25:59 +00:00
a11a245307 fix: use unsigned short for BF16 storage, inline PTX for conversions 2026-05-28 05:24:32 +00:00
373900fa08 FMHA SM100: Fix launch wrapper to match new kernel API 2026-05-28 05:20:31 +00:00
a30ebfb197 FMHA SM100: Full kernel with TMET PTX, UMMA descriptors, softmax loop
- TMEM alloc/dealloc/load/store via inline PTX (tcgen05.*)
- UMMA SMEM descriptor construction (make_umma_desc)
- QK GEMM via tcgen05.mma.kind::f16 inline asm
- Online softmax with D3/D4/D5c masks
- O rescale in REGISTERS (D1.5 fix — no TMEM round-trip!)
- FP4 quantize helpers (hs2e2m1, fp8_e4m3_encode)
- Still needs: PV GEMM, proper P staging, TMEM O load/store
2026-05-28 05:19:34 +00:00
09dfd4a41f fix: rename .cpp to .cu for CUDA compilation 2026-05-28 05:16:41 +00:00
48baea7728 FMHA SM100: Remove CUTLASS includes, write raw PTX inline asm
CUTLASS headers transitively include cuda_bf16.h which has a CUDA 13.2
in_place_from bug. Writing tcgen05 PTX directly via inline asm instead.
No dependencies on CUTLASS C++ — pure PTX + CUDA runtime.
2026-05-28 05:15:07 +00:00
88d5995ec9 fix: define bf16_t using __bf16 built-in, avoid cuda_bf16.h bug 2026-05-28 05:14:01 +00:00
6bd3356582 fix: include cuda_bf16.h unconditionally, add --expt-relaxed-constexpr 2026-05-28 05:13:01 +00:00
c1266b5275 fix: include cuda_bf16.h only in device code 2026-05-28 05:12:30 +00:00
a64e55665b fix: avoid cuda_bf16.h, use inline PTX for BF16 conversion 2026-05-28 05:12:08 +00:00