Commit Graph

3 Commits

Author SHA1 Message Date
6504f091ca NVFP4-1.1 Step 3: post-SWiGLU quantization test suite (all PASS)
- Standalone kernel cos 0.979 (128x512)
- Post-SwiGLU quantization cos 0.976 (vs Python 0.995)
- Larger shape cos 0.979 (512x4096)
- FP8 scale match 100% across all tests
- GPU kernel replaces CPU-GPU sync quantize path
- Ready for integration into MoE pipeline
2026-05-25 09:08:01 +00:00
5e8347836f NVFP4-1.1: working BF16→FP4 quantize kernel (cos 0.979)
- Standalone CuTeDSL kernel using cute.arch.load/store
- 1 CTA per row, 32 threads/CTA
- BF16 load via Uint16 bitcast
- FP8 E4M3 scale output (100% match)
- FP4 packed nibble output (cos 0.979 vs Python ref)
- Uses absf + arithmetic max/min (CuTeDSL ternary limitation)
- Step 2 of SwiGLU FP4 fusion pipeline
2026-05-25 08:58:19 +00:00
52d11d7f92 NVFP4-1.1: standalone BF16→FP4 quantize kernel (WIP) + dequantize verification 2026-05-25 03:23:44 +00:00