Files
nvfp4-megamoe-kernel/dsv4
biondizzle 3cb339129b FMHA SM100: Fix Phase 1 — single-thread reference for correctness
Use thread 0 for all computation (slow but correct).
SMEM for Q and O sharing across threads.
Online softmax with O rescale — correct D1.5 approach.
D3 SWA mask implemented.
Target: cos ~0.999998 then parallelize.
2026-05-28 05:32:47 +00:00
..