nvfp4-megamoe-kernel

Files

biondizzle 71deeb91a9 Quantize BF16 gate weight to NVFP4 for fused router + add global scales to GEMM

CRITICAL: Checkpoint stores gate weights as BF16, not NVFP4.
Previous code fell back to BF16 cuBLAS because weight_scale was missing.
Now we quantize the BF16 gate weight to NVFP4 at load time using
quantize_to_nvfp4() and pass the result to the fused router kernel.

Also added global scale (gsa, gsb) parameters to the kernel:
- gsa (activation global scale) applied during activation quantization
- gsb (weight global scale) applied in epilogue before sqrt(softplus)
- The MMA output is (A * SFA) @ (B * SFB), missing gsa*gsb
- Epilogue now computes sqrt(softplus(logit * gsa * gsb))
  instead of sqrt(softplus(logit))

2026-06-01 10:14:29 +00:00

attention

FMHA sink: don't double-scale sink bias

2026-05-31 23:12:20 +00:00

cache

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

compressor

fix: import torch.utils.cpp_extension explicitly in production_compress

2026-06-01 05:20:44 +00:00

cuda

fix: move compressor position_bias into CUDA kernel (was Python loop)