biondizzle
0b6ca0df80
P5 integration + B3 q_a_norm fused + gsa scalar fix
P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel
- Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches)
- With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4)
- Both attn and ffn mHC paths now use P5 fused kernel
- Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token
B3: Fused rmsnorm+quant for q_a_norm → q_b path
- q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized
- Eliminates BF16 round-trip between q_a_norm and q_b GEMM
- Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2)
gsa scalar fix in Nvfp4Linear.run_from_quantized:
- CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,))
- Per-row gsa from fused kernels must be reduced to scalar (max) for M>1
- For M=1 decode: already scalar, no reduction needed
- Fixes potential correctness issue at prefill (M>1) when using fused paths
Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)
2026-06-02 21:20:34 +00:00
..
2026-05-21 17:30:44 +00:00
2026-06-01 22:21:12 +00:00
2026-06-02 21:20:34 +00:00
2026-06-02 10:44:53 +00:00
2026-06-02 07:59:57 +00:00
2026-06-01 14:21:16 +00:00
2026-06-02 08:41:00 +00:00