nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Author	SHA1	Message	Date
biondizzle	6504f091ca	NVFP4-1.1 Step 3: post-SWiGLU quantization test suite (all PASS) - Standalone kernel cos 0.979 (128x512) - Post-SwiGLU quantization cos 0.976 (vs Python 0.995) - Larger shape cos 0.979 (512x4096) - FP8 scale match 100% across all tests - GPU kernel replaces CPU-GPU sync quantize path - Ready for integration into MoE pipeline	2026-05-25 09:08:01 +00:00
biondizzle	5e8347836f	NVFP4-1.1: working BF16→FP4 quantize kernel (cos 0.979) - Standalone CuTeDSL kernel using cute.arch.load/store - 1 CTA per row, 32 threads/CTA - BF16 load via Uint16 bitcast - FP8 E4M3 scale output (100% match) - FP4 packed nibble output (cos 0.979 vs Python ref) - Uses absf + arithmetic max/min (CuTeDSL ternary limitation) - Step 2 of SwiGLU FP4 fusion pipeline	2026-05-25 08:58:19 +00:00
biondizzle	52d11d7f92	NVFP4-1.1: standalone BF16→FP4 quantize kernel (WIP) + dequantize verification	2026-05-25 03:23:44 +00:00

Author

SHA1

Message

Date

biondizzle

6504f091ca

NVFP4-1.1 Step 3: post-SWiGLU quantization test suite (all PASS)

- Standalone kernel cos 0.979 (128x512)
- Post-SwiGLU quantization cos 0.976 (vs Python 0.995)
- Larger shape cos 0.979 (512x4096)
- FP8 scale match 100% across all tests
- GPU kernel replaces CPU-GPU sync quantize path
- Ready for integration into MoE pipeline

2026-05-25 09:08:01 +00:00

biondizzle

5e8347836f

NVFP4-1.1: working BF16→FP4 quantize kernel (cos 0.979)

- Standalone CuTeDSL kernel using cute.arch.load/store
- 1 CTA per row, 32 threads/CTA
- BF16 load via Uint16 bitcast
- FP8 E4M3 scale output (100% match)
- FP4 packed nibble output (cos 0.979 vs Python ref)
- Uses absf + arithmetic max/min (CuTeDSL ternary limitation)
- Step 2 of SwiGLU FP4 fusion pipeline

2026-05-25 08:58:19 +00:00

biondizzle

52d11d7f92

NVFP4-1.1: standalone BF16→FP4 quantize kernel (WIP) + dequantize verification

2026-05-25 03:23:44 +00:00

3 Commits