nvfp4-megamoe-kernel

Files

biondizzle 882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)

The B200 container crashes in DeepGEMM's fp8_einsum (t.dim() == N assertion
in layout.hpp:39) when processing wo_a (o-projection first half) in the
attention layer. The crash is caused by scale tensor dimension mismatch
for the SM100 recipe (1, 1, 128).

Instead of fighting DeepGEMM, replace the entire wo_a path with our own
CuTeDSL NVFP4 kernel:

1. inverse_rope_bf16() — Python implementation of inverse RoPE
   (replaces fused_inv_rope_fp8_quant CUDA kernel)
2. CuTeDSLNvfp4WoA — NVFP4 grouped linear for wo_a using
   ScaledGroupedGemm with n_local_groups=8 groups
3. wo_a weight quantized to NVFP4 instead of FP8 (native NVFP4,
   no conversion to another quantization)

Changes:
- cutedsl/inverse_rope.py: BF16 inverse RoPE (conjugate rotation)
- cutedsl/wo_a_grouped_linear.py: CuTeDSL NVFP4 grouped GEMM for wo_a
- vllm/patches/deepseek_v4_attention.py: Use NVFP4 path when runner
  is initialized, keep DeepGEMM fallback
- vllm/patches/deepseek_v4.py: Init NVFP4 runner instead of FP8 quant
- tests/test_wo_a.py: Unit test for inverse RoPE + wo_a GEMM

2026-05-19 02:36:30 +00:00

cudagraph_test.py

fix: test L2 weight N dim should be hidden_size, not hidden_size//2

2026-05-16 19:07:36 +00:00

debug_output.py

Update CURRENT_BUG.md: current status, outstanding garbage output issue, hypotheses

2026-05-17 16:52:40 +00:00

layertest.py

restore: new bridge/moe_pipeline/layertest

2026-05-16 19:55:19 +00:00

requirements.txt

test: add standalone layer 0 comparison test (no vLLM, no Docker)