Files
nvfp4-megamoe-kernel/tests/unit
biondizzle 794ebaf7e5 P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+)
- fused_rmsnorm_quantize.cu: two-kernel approach
  Kernel 1: rmsnorm_amax_gsa — compute RMS + amax of normalized output → gsa per row
  Kernel 2: rmsnorm_quantize_nvfp4 — normalize + quantize using GPU-computed gsa
- Python bridge: rmsnorm_quantize_nvfp4() in ops/quantize.py
- Python bridge: dequantize_nvfp4() in ops/quantize.py
- Unit test: test_fused_rmsnorm_quantize.py (production shapes: 7168 hidden)
- Eliminates ~488 kernel launches per token (122 sites × 4 launches saved)
2026-06-02 16:26:24 +00:00
..
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 16:28:58 +00:00
2026-05-30 03:46:38 +00:00
2026-05-28 16:28:58 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:46:53 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:55:59 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 15:59:22 +00:00
2026-05-28 14:38:03 +00:00
2026-06-02 09:43:45 +00:00
2026-06-01 15:04:46 +00:00
2026-06-01 15:04:46 +00:00
2026-05-28 14:40:55 +00:00
2026-05-28 14:33:31 +00:00
2026-05-28 16:36:53 +00:00
2026-05-28 17:00:20 +00:00
2026-05-28 16:39:45 +00:00
2026-05-28 16:42:24 +00:00
2026-05-28 15:51:55 +00:00
2026-05-28 15:49:47 +00:00
2026-05-28 15:48:15 +00:00
2026-05-28 15:54:05 +00:00
2026-05-28 11:39:15 +00:00