biondizzle
a30ebfb197
FMHA SM100: Full kernel with TMET PTX, UMMA descriptors, softmax loop
- TMEM alloc/dealloc/load/store via inline PTX (tcgen05.*)
- UMMA SMEM descriptor construction (make_umma_desc)
- QK GEMM via tcgen05.mma.kind::f16 inline asm
- Online softmax with D3/D4/D5c masks
- O rescale in REGISTERS (D1.5 fix — no TMEM round-trip!)
- FP4 quantize helpers (hs2e2m1, fp8_e4m3_encode)
- Still needs: PV GEMM, proper P staging, TMEM O load/store
2026-05-28 05:19:34 +00:00
..
2026-05-28 05:19:34 +00:00
2026-05-22 00:08:38 +00:00
2026-05-21 17:30:44 +00:00
2026-05-25 16:21:44 +00:00
2026-05-21 17:30:44 +00:00
2026-05-28 04:59:01 +00:00
2026-05-22 01:20:39 +00:00
2026-05-21 22:04:20 +00:00
2026-05-21 17:30:44 +00:00