Commit Graph

1341 Commits

Author SHA1 Message Date
cec505ce14 add CUDA test runner script (screen-based, follows harness pattern) 2026-05-28 07:31:41 +00:00
2eb44a00bf fix(tmem): warp-collective TMEM ops + one-way correction epilogue
Key fixes for fmha_epilogue_sm100.cuh hang:
- tcgen05.ld/st are WARP-COLLECTIVE: ALL 32 lanes must execute
- Old code guarded TMEM ops with if(tid==0) = warp divergence = HANG
- tmem_dealloc now uses tmem_base (value from alloc), not SMEM pointer
- Compute attention in SMEM, then do one-way TMEM pipeline:
  SMEM → TMEM (warp-collective store) → regs (warp-collective load)
  → normalize in regs → BF16 cast → GMEM
- This proves the MoE-style one-way correction epilogue on FMHA

Also: enable TMEM kernel test + hd=128 in standalone test
2026-05-28 07:27:25 +00:00
bd16e8fa85 fix: use tcgen05.wait::st/ld instead of nonexistent tcgen05.fence
ROOT CAUSE of TMET hang: tcgen05.fence.cta_group::1.sync.aligned is
NOT a valid PTX instruction. The correct TMEM ordering primitives are:
- tcgen05.wait::st.sync.aligned (wait for TMEM stores to complete)
- tcgen05.wait::ld.sync.aligned (wait for TMEM loads to complete)

Found in cutlass/arch/barrier.h fence_view_async_tmem_store/load.
2026-05-28 07:12:26 +00:00
ba1e81f2dc test: minimal TMEM isolation test (alloc, store, load, dealloc) 2026-05-28 07:09:06 +00:00
4fe9bbab48 add back in the archived code 2026-05-28 07:04:59 +00:00
4336de9372 attention/: Clean up folder, archive backups, add detailed status headers
What changed:
- Moved fmha_backup_pre_epilog.py, fmha_backup_v2.py, fmha_smem_acc.py to archive/
- Deleted fmha.py.backup (git has history)
- Added detailed heredoc headers to ALL files documenting:
  * WHAT WORKS and WHAT'S BROKEN
  * WHY each limitation exists (CuTeDSL toolchain gaps)
  * KEY INSIGHTS FOR NVIDIA (what CuTeDSL is missing)
  * What each file unblocks if fixed

File status:
  fmha.py                 — CuTeDSL FMHA, cos 0.999998, D1.5 workaround
  fmha_common.cuh         — Raw CUDA shared defs (BF16, TMEM ops)
  fmha_sm100.cuh          — Raw CUDA reference, cos 0.999999
  fmha_epilogue_sm100.cuh — Raw CUDA TMEM epilogue, HANGS (needs debug)
  fmha_sm100_launch.cu    — PyTorch binding (JIT broken, nvcc works)
  production.py           — CuTeDSL production wrapper (partial)
  archive/                — Historical backups with explanation headers
2026-05-28 07:01:33 +00:00
d46ae8b967 test: disable TMEM test (hanging), verify reference still works 2026-05-28 06:46:27 +00:00
e58980f80e fix: increase test timeout for TMEM kernel 2026-05-28 06:41:59 +00:00
a391615f60 fix: uint64_t for SMEM pointer 2026-05-28 06:39:19 +00:00
b4779e3f48 fix: cvta.to.shared.u64 for 64-bit SMEM pointers 2026-05-28 06:37:52 +00:00
cf264bd0e2 fix: cvta.shared.u32 (not cvta.to.shared) 2026-05-28 06:36:50 +00:00
771799e112 FMHA SM100: Fix TMEM operations — uint32_t registers, correct PTX syntax
TMEM load/store uses b32 (uint32_t) registers, NOT float.
Bitcast float↔uint32_t for FP32 TMEM values.
TMEM alloc takes SMEM pointer (not a return value).
TMEM column addressing: col + row_group * tmem_n.
2026-05-28 06:35:50 +00:00
73d1e38129 fix: last HD→HD_val 2026-05-28 06:32:55 +00:00
e940786fd5 fix: HD_val variable name in test 2026-05-28 06:32:01 +00:00
e173295a3a FMHA SM100: Refactor into common + reference + TMEM epilogue headers
- fmha_common.cuh: BF16, TMEM ops, warp reductions (shared)
- fmha_sm100.cuh: Phase 1 reference (SMEM-based, cos 0.999999)
- fmha_epilogue_sm100.cuh: Phase 2 TMEM+correction epilogue (Priority 2)
- Test both kernels at hd=64 and hd=128
2026-05-28 06:31:05 +00:00
a73fb689f9 fix: dispatch template HD at compile time 2026-05-28 06:29:10 +00:00
bcc5d0b6cb FMHA SM100: Add TMEM+correction epilogue kernel (Priority 2)
New file: fmha_epilogue_sm100.cuh
- TMEM alloc/dealloc/load/store via tcgen05 PTX
- One-way correction epilogue: TMEM→regs→normalize→BF16→GMEM
- D1.5 fix: O rescale in REGISTERS (TMEM→regs→multiply→TMEM)
- Same pattern as MoE epilogue but with normalize instead of SwiGLU
- Unblocks D2 multi-CTA and NVFP4-1.2 (register slot for FP4 pack)

Test: hd=64 + hd=128, reference vs TMEM kernels
2026-05-28 06:27:56 +00:00
8eb735618f fix: use expf for softmax (not exp2f with scale) 2026-05-28 05:34:03 +00:00
3cb339129b FMHA SM100: Fix Phase 1 — single-thread reference for correctness
Use thread 0 for all computation (slow but correct).
SMEM for Q and O sharing across threads.
Online softmax with O rescale — correct D1.5 approach.
D3 SWA mask implemented.
Target: cos ~0.999998 then parallelize.
2026-05-28 05:32:47 +00:00
7fb838913f fix: include path for standalone test 2026-05-28 05:31:39 +00:00
99b35eb2de test: standalone CUDA test for FMHA SM100 (no PyTorch needed) 2026-05-28 05:31:03 +00:00
77fa34a9a6 fix: update launch wrapper for fmha_decode_ref 2026-05-28 05:28:49 +00:00
00ac46c9d3 FMHA SM100: Phase 1 — reference scalar implementation
Simpler approach first: scalar Q@K^T, softmax, P@V in registers.
No TMEM/MMA yet — verify correctness first, then replace with tcgen05.

- 192-thread CTA, all threads cooperate on one (batch, head)
- Online softmax with O rescale (correct D1.5 approach)
- D3 SWA mask, D4 causal (TODO), D5c sink (TODO)
- KV loaded in blocks of 128 for SMEM efficiency
- Correctness target: cos ~0.999998 against PyTorch reference
2026-05-28 05:27:36 +00:00
6f7449ce71 FMHA SM100: Fix tcgen05.mma PTX syntax — correct register constraints
- tcgen05.mma.cta_group::1.kind::f16 [tmem_c], desc_a, desc_b, idescE_hi, scaleC, {mask0..3}, pred
- idescE is upper 32 bits of the E descriptor
- scaleC is a float (1.0 for accumulate)
- mask is 4 uint32 values (0xFFFFFFFF for no masking)
2026-05-28 05:25:59 +00:00
a11a245307 fix: use unsigned short for BF16 storage, inline PTX for conversions 2026-05-28 05:24:32 +00:00
2d4e2c57e0 auto: pre-test commit 2026-05-28 05:22:23 +00:00
97df02ea07 fix: -Xcompiler -fPIC for nvcc shared library 2026-05-28 05:22:15 +00:00
4dfb71bc20 test: nvcc direct compilation test (avoid torch JIT __bf16 ICE) 2026-05-28 05:21:41 +00:00
373900fa08 FMHA SM100: Fix launch wrapper to match new kernel API 2026-05-28 05:20:31 +00:00
a30ebfb197 FMHA SM100: Full kernel with TMET PTX, UMMA descriptors, softmax loop
- TMEM alloc/dealloc/load/store via inline PTX (tcgen05.*)
- UMMA SMEM descriptor construction (make_umma_desc)
- QK GEMM via tcgen05.mma.kind::f16 inline asm
- Online softmax with D3/D4/D5c masks
- O rescale in REGISTERS (D1.5 fix — no TMEM round-trip!)
- FP4 quantize helpers (hs2e2m1, fp8_e4m3_encode)
- Still needs: PV GEMM, proper P staging, TMEM O load/store
2026-05-28 05:19:34 +00:00
09dfd4a41f fix: rename .cpp to .cu for CUDA compilation 2026-05-28 05:16:41 +00:00
4c194b7254 fix: add CUDA include path for host compiler 2026-05-28 05:15:48 +00:00
48baea7728 FMHA SM100: Remove CUTLASS includes, write raw PTX inline asm
CUTLASS headers transitively include cuda_bf16.h which has a CUDA 13.2
in_place_from bug. Writing tcgen05 PTX directly via inline asm instead.
No dependencies on CUTLASS C++ — pure PTX + CUDA runtime.
2026-05-28 05:15:07 +00:00
88d5995ec9 fix: define bf16_t using __bf16 built-in, avoid cuda_bf16.h bug 2026-05-28 05:14:01 +00:00
f0660d0bd7 fix: use C++20 for cuda_bf16.h compat 2026-05-28 05:13:18 +00:00
6bd3356582 fix: include cuda_bf16.h unconditionally, add --expt-relaxed-constexpr 2026-05-28 05:13:01 +00:00
c1266b5275 fix: include cuda_bf16.h only in device code 2026-05-28 05:12:30 +00:00
a64e55665b fix: avoid cuda_bf16.h, use inline PTX for BF16 conversion 2026-05-28 05:12:08 +00:00
1734d13f60 fix: restore cuda_bf16.h include 2026-05-28 05:11:39 +00:00
8783a25deb fix: guard cuda_bf16.h with __CUDA_ARCH__ 2026-05-28 05:11:11 +00:00
5e389b5ed9 fix: remove duplicate desc declaration 2026-05-28 05:10:43 +00:00
7ac2499266 fix: defer UMMA descriptor — use placeholder for now 2026-05-28 05:10:15 +00:00
db17d8db9a fix: cvta.to.shared PTX for SMEM address 2026-05-28 05:09:50 +00:00
e12a81ae36 fix: include cstdint 2026-05-28 05:09:28 +00:00
0c73a024ba fix: guard CUTLASS includes with __CUDA_ARCH__ for host compilation 2026-05-28 05:09:07 +00:00
41e59a2423 FMHA SM100: Add SMEM descriptor construction for tcgen05.mma 2026-05-28 05:08:25 +00:00
3eb432d064 fix: CUTLASS path /root/cutlass 2026-05-28 05:06:48 +00:00
66d9f5c60f fix: --x cu for .cuh compilation 2026-05-28 05:06:13 +00:00
4dcd80ea0d fix: use full nvcc path 2026-05-28 05:05:55 +00:00
fac7275f2b test: nvcc compilation test for FMHA SM100 kernel 2026-05-28 05:05:31 +00:00