Files
nvfp4-megamoe-kernel/dsv4/kernels
biondizzle a30ebfb197 FMHA SM100: Full kernel with TMET PTX, UMMA descriptors, softmax loop
- TMEM alloc/dealloc/load/store via inline PTX (tcgen05.*)
- UMMA SMEM descriptor construction (make_umma_desc)
- QK GEMM via tcgen05.mma.kind::f16 inline asm
- Online softmax with D3/D4/D5c masks
- O rescale in REGISTERS (D1.5 fix — no TMEM round-trip!)
- FP4 quantize helpers (hs2e2m1, fp8_e4m3_encode)
- Still needs: PV GEMM, proper P staging, TMEM O load/store
2026-05-28 05:19:34 +00:00
..