nvfp4-megamoe-kernel/tests/layertest.py at 50e9b5da81477e000e4770e94de9d175204f5ff0

Files

biondizzle aa8563c626 Fused SwiGLU epilogue with granularity-8 weight interleave

- Fix interleave_l1_weights: remove //2 bug (g=granularity_bf16 for N-axis)
- Apply L1 weight+SF interleave in runner._ensure_stacked() and moe_pipeline
- De-interleave L1 GEMM output before gate/up split
- Fused SwiGLU kernel: epi_tile=(128,8) for subtile-level pairing
  - Even subtiles = gate: SiLU in FP32 registers, save to register buffer
  - Odd subtiles = up: silu(gate)*up from buffer
  - Both branches produce same BF16 tensor type (CuTeDSL constraint)
- run_nvfp4_moe_fused() pipeline: fused L1 + PyTorch L2
- Runner: fused_swiglu=True option for CuTeDSLMoERunner
- Layertest: both fused and non-fused paths PASS (cosine 0.988)
- README.md updated with current status and lessons learned

2026-05-20 04:13:52 +00:00

12 KiB

Raw Blame History

View Raw

12 KiB Raw Blame History

12 KiB

Raw Blame History