Files
nvfp4-megamoe-kernel/tests
biondizzle a4ef6c3454 Add B1 mixed FP8 prefill FMHA kernel (T>1 support)
New files:
- fmha_mixed_fp8_prefill.cuh: kernel supporting T=1..128
  - Sub-batch processing (T_BATCH=32) to fit in 232KB SMEM
  - Multi-row QK TMEM read using tcgen05.ld.32x32b.x8
  - Per-row online softmax
  - Per-row PV MMA (correctness first; batched PV is TODO)
  - Attention sink support
- fmha_mixed_fp8_prefill_capi.cu: C API bridge
- fmha_mixed_fp8_prefill_op.py: Python ctypes loader
- test_b1_mixed_fp8_prefill.py: unit test (T=1..32, N=128..4096)

Also: fix production FMHA layer test (BF16 fallback for o_a_proj,
router gate BF16 quantize path, missing DEVICE constant)
2026-06-03 02:50:27 +00:00
..
2026-05-31 20:11:37 +00:00