|
|
5e8347836f
|
NVFP4-1.1: working BF16→FP4 quantize kernel (cos 0.979)
- Standalone CuTeDSL kernel using cute.arch.load/store
- 1 CTA per row, 32 threads/CTA
- BF16 load via Uint16 bitcast
- FP8 E4M3 scale output (100% match)
- FP4 packed nibble output (cos 0.979 vs Python ref)
- Uses absf + arithmetic max/min (CuTeDSL ternary limitation)
- Step 2 of SwiGLU FP4 fusion pipeline
|
2026-05-25 08:58:19 +00:00 |
|