- Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax) - Add quantize_nvfp4_gpu() bridge in ops/quantize.py - Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda) - Wire GPU quantize into Nvfp4MoE._run_impl(): - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4) - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel) - Non-fused L2: quantize_nvfp4_gpu - Add test_nvfp4_gpu_quantize.py for both kernels
11 KiB
11 KiB