Files
nvfp4-megamoe-kernel/tests
biondizzle 454dbdad52 P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel
- fused_mhc_rmsnorm_quantize.cu: 2-kernel approach
  Kernel 1: mhc_rmsnorm_amax_gsa — bmm + RMS + amax → gsa
  Kernel 2: mhc_rmsnorm_quantize_nvfp4 — bmm + normalize + quantize
- Python bridge: mhc_rmsnorm_quantize_nvfp4() in ops/quantize.py
- Unit test: test_fused_mhc_rmsnorm_quantize.py (production shapes)
- Eliminates ~610 kernel launches per token (122 sites × 5 launches saved)
2026-06-02 16:39:42 +00:00
..
2026-05-30 21:22:34 +00:00
2026-05-31 20:11:37 +00:00