biondizzle
794ebaf7e5
P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+)
- fused_rmsnorm_quantize.cu: two-kernel approach
Kernel 1: rmsnorm_amax_gsa — compute RMS + amax of normalized output → gsa per row
Kernel 2: rmsnorm_quantize_nvfp4 — normalize + quantize using GPU-computed gsa
- Python bridge: rmsnorm_quantize_nvfp4() in ops/quantize.py
- Python bridge: dequantize_nvfp4() in ops/quantize.py
- Unit test: test_fused_rmsnorm_quantize.py (production shapes: 7168 hidden)
- Eliminates ~488 kernel launches per token (122 sites × 4 launches saved)
2026-06-02 16:26:24 +00:00
..
2026-05-30 21:22:34 +00:00
2026-05-21 17:30:44 +00:00
2026-06-02 16:26:24 +00:00
2026-05-22 17:07:23 +00:00
2026-05-31 12:05:19 +00:00
2026-05-31 20:11:37 +00:00
2026-05-31 18:38:34 +00:00
2026-06-02 04:10:39 +00:00
2026-05-16 02:13:18 +00:00
2026-05-22 17:08:12 +00:00
2026-05-31 09:17:07 +00:00
2026-05-31 09:23:10 +00:00
2026-05-31 20:23:18 +00:00
2026-05-31 05:55:10 +00:00