nvfp4-megamoe-kernel

Files

biondizzle 2b1fca6dae CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow

The checkpoint's input_scale was designed for training-time FP8 quantization,
not NVFP4 activation quantization. Using it as gsa causes x/gsa to exceed
the E4M3 block scale maximum (448), leading to systematic magnitude loss
in every projection. This accumulates over 61 layers, compressing the
logit range and producing garbage tokens.

Fix: compute gsa at runtime from actual activation magnitude:
  gsa = max(|x|) / (6.0 * 448.0)
This ensures x/gsa ≤ 2688 (the maximum representable in E4M3 block scales).

Applied to: Nvfp4Linear, Nvfp4GroupedLinear, Nvfp4MoE, Nvfp4SharedExpert, Router gate

2026-06-01 14:21:16 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

Switch router to Nvfp4Linear production GEMM (custom CuTeDSL kernel crashes MLIR)

2026-06-01 11:17:54 +00:00

layers

CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow

2026-06-01 14:21:16 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering