Files
nvfp4-megamoe-kernel/CURRENT_BUG.md
biondizzle b3451c74f8 Update README and CURRENT_BUG.md with current state
- README: updated NVFP4 coverage table, status, and plan
- CURRENT_BUG.md: full debugging journey, what works, what's next
- Both reflect decision to build our own CuTeDSL kernels
2026-05-18 20:05:03 +00:00

3.9 KiB

Current State: Building our own NVFP4 kernels

Status: WIP — shared expert CuTeDSL kernel in progress Date: 2026-05-18

What happened today

Spent the day debugging why vLLM produces empty/garbage output. The journey:

  1. NaN from layer 0 — diagnostic prints showed NaN from the very first layer
  2. MoE kernel is fine — standalone test: cosine 0.988, no NaN
  3. Root cause: FlashInferCutlassNvFp4LinearKernel uses broken input_scale — checkpoint values cause 3977x amplification during activation quantization → NaN
  4. BF16 dequant fix — dequantize NVFP4 weights to BF16, replace quant method
  5. process_weights_after_loading timing — our fix runs inside load_weights(), but vLLM's quant method runs AFTER. Fix gets overwritten.
  6. Post-quant hook approach — forward pre-hooks don't fire (torch.compile + model wrappers bypass them)
  7. Patched utils.py — added _post_quant_fix() call at end of process_weights_after_loading. This works — 305 projections dequantized to BF16.
  8. Still garbage — even with 183 attention + 122 shared expert projections in BF16, output is still empty.
  9. Conclusion: vLLM's pipeline has deeper issues. The FlashInferCutlassNvFp4LinearKernel is untrustworthy on B200 (same class of C++ CUTLASS FP4 bugs we hit with MoE). BF16 dequant doesn't fix it because something else is broken in vLLM's execution path.

Decision: Build our own NVFP4 kernels for shared experts and attention. Same CuTeDSL approach that works for MoE. Stop fighting vLLM's broken kernels.

Confirmed Working

Component Kernel Status
MoE experts (384/layer) CuTeDSL ScaledGroupedGemm cosine 0.988, cudagraph-safe
All NVFP4 weights Dequant to BF16 Valid output in standalone test
Full attention weight chain BF16 matmul No NaN, no zeros

In Progress

Component Kernel Status
Shared experts CuTeDSL GEMM (1 group) 🔧 Runner WIP, scale assembly needs fixing
Attention projections CuTeDSL GEMM 📋 Next after shared experts

WIP: Shared Expert CuTeDSL Kernel

Files:

  • cutedsl/shared_expert_pipeline.py — dedicated runner (needs scale assembly fix)
  • tests/test_shared_expert.py — standalone test

Issue: Tried reusing MoE runner with num_experts=1 — fails because MoE runner's scatter assumes hidden_size != HC_DIM. The MoE runner does output.scatter_add_ which expects expert output shape [tokens, hidden_size] but shared expert operates on HC_DIM (28672).

Fix needed: Dedicated runner with correct scale assembly for num_groups=1. The MoE runner's _assemble_scales_cudagraph_safe is the template. For a single group:

  • No expert offsets needed
  • No scatter needed (all tokens go to the same expert)
  • Scale assembly is just: quantize activation → pad to 128-row alignment → Blackwell swizzle
  • Simpler than the MoE case

Plan

Phase 1: Shared Expert Kernel (WIP)

  1. Fix shared_expert_pipeline.py — implement scale assembly for num_groups=1
  2. Test with test_shared_expert.py — target cosine ≥ 0.98 vs BF16 reference
  3. Add cudagraph test
  4. Wire into vLLM via DeepseekV4MoE.forward()

Phase 2: Attention NVFP4 Kernel

  • Each attention projection is a standard NVFP4 GEMM
  • fused_wqa_wkv has dual weight_scale_2 (same as MoE gate+up)
  • Handle wo_a — currently FP8, could stay FP8 or go native NVFP4
  • Test each projection individually, then integrate

Phase 3: Clean Up

  • Remove all BF16 dequant code
  • Remove vllm/patches/utils.py patch
  • Remove _post_quant_fix()
  • All NVFP4 through CuTeDSL, no vLLM FlashInfer kernels

Memory Layout

Component NVFP4 Size BF16 Size Notes
Shared expert (per layer) 33MB 66MB Small, 2GB total
Attention (per layer) ~TBD ~TBD 5 projections
MoE experts (per layer) ~TBD ~TBD 48 experts, stays NVFP4