Files
nvfp4-megamoe-kernel/tests
biondizzle df04ba40ee Stage C: online softmax kernel (WIP) - test_fmha_v3_softmax.py
- C1: Real softmax reference (torch.softmax, not identity)
- C2: Per-thread row_max/row_sum registers
- C3: QK scale folded (1/sqrt(d) * log2(e))
- C4: Row max via .reduce(MAX)
- C5: Rescale factor (exp2(old_max - new_max))
- C6: O rescale in TMEM (correction_rescale pattern)
- C7: Real exp2 for P computation
- C8: Row sum via packed f32x2 reduction
- C9: Final normalization (1/row_sum in epilogue)
- Dynamic s_k for V FMHA reconstruction
- fastmath=False for correctness first
2026-05-21 17:10:58 +00:00
..
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 10:50:30 +00:00
2026-05-17 22:58:27 +00:00
2026-05-17 07:37:47 +00:00