Files
nvfp4-megamoe-kernel/CURRENT_BUG.md
biondizzle e8b289e30d WIP: CuTeDSL shared expert kernel
Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py).
Tried reusing MoE runner with 1 expert — fails because MoE runner assumes
hidden_size != HC_DIM for scatter. Need dedicated runner with correct
scale assembly. Will continue tomorrow.
2026-05-18 20:02:19 +00:00

5.2 KiB
Raw Blame History

Current Bug: vLLM produces empty/garbage output

Status: Debugging, plan revised — building our own kernels Date: 2026-05-18

Symptom

  • vLLM server starts, loads model, processes requests (200 OK)
  • Chat completions return content: "" with finish_reason: "length"
  • 20 completion tokens generated but all produce empty/NaN logits

What we know

Confirmed working

  • MoE expert CuTeDSL kernel — cosine 0.988, cudagraph-safe, production-ready
  • All NVFP4 weights dequantize correctly to BF16 — standalone test proves it
  • Full attention weight chain produces valid output (embed → q_a → norm → q_b → o_a → o_b)
  • Post-quant fix runs at the right time — patched utils.py calls _post_quant_fix() after process_weights_after_loading
  • 183 attention projections dequantized to BF16 (61 layers × 3 projs)

Still broken

  • Even with BF16 attention, model produces empty output
  • Shared experts also use FlashInferCutlassNvFp4LinearKernel with broken input_scale
  • Added shared experts to BF16 dequant fix (122 more projections) — testing in progress

🔥 The real problem: vLLM's NVFP4 kernels are untrustworthy on B200

We spent the entire day fighting vLLM's FlashInferCutlassNvFp4LinearKernel:

  • Broken input_scale → NaN
  • process_weights_after_loading timing issues
  • Forward hooks not firing due to torch.compile/model wrappers
  • Dequant-to-BF16 workaround is a bandaid that loses NVFP4 benefits

We could have built our own kernel in the time we spent debugging theirs.

Revised Plan: Our Own NVFP4 Kernels

Goal: Replace ALL vLLM NVFP4 kernel paths with our own CuTeDSL implementations. No more FlashInferCutlassNvFp4LinearKernel. No more BF16 dequant workarounds.

Phase 0: Get the BF16 fix working (current)

  • Post-quant BF16 dequant for attention + shared experts
  • Verify the model produces actual text output
  • This is the "make it work" step

Phase 1: CuTeDSL Shared Expert Kernel

Priority: High — shared experts are the last NVFP4 component using vLLM's broken kernel

Files to create:

  • cutedsl/shared_expert_pipeline.py — L1 GEMM → SiLU → re-quant → L2 GEMM
    • Same pattern as MoE but simpler: no routing, no topk, no scatter
    • gate_up_proj already stacked (same as MoE L1)
    • down_proj same as MoE L2
  • vllm/nvfp4_shared_expert.py — runner class
    • Cudagraph-safe (pre-allocated buffers)
    • Warmup-based gs computation (same as MoE)
    • Called from DeepseekV4MoE.forward() for shared expert path
  • tests/test_shared_expert.py — standalone test
    • Load shared expert weights from checkpoint
    • CuTeDSL vs BF16 reference (cosine)
    • Cudagraph test

Why it's easy: Shared experts are literally MoE with 1 expert and no routing. The CuTeDSL ScaledGroupedGemmKernel with num_groups=1 is just a regular GEMM.

Phase 2: CuTeDSL Attention Kernel

Priority: High — attention is the biggest remaining NVFP4 component

Components to handle:

  • fused_wqa_wkv — MergedColumnParallelLinear (q_a + kv fused)
  • wq_b — ColumnParallelLinear (second Q projection)
  • wo_a — currently FP8 via fp8_einsum
  • wo_b — ColumnParallelLinear (output projection)

Design options:

  1. Separate GEMMs — one CuTeDSL GEMM per projection, simplest
  2. Fused attention GEMM — batch all projections together (more complex, more speed)

Recommended: Start with option 1. Each projection is just a standard NVFP4 GEMM. No need to fuse. We can optimize later.

Files to create:

  • cutedsl/attention_pipeline.py — NVFP4 GEMMs for each attention projection
  • vllm/nvfp4_attention.py — runner class
    • Handles q_a_proj, kv_proj, q_b_proj, o_a_proj, o_b_proj
    • Cudagraph-safe
    • Warmup gs for each projection
  • tests/test_attention_nvfp4.py — standalone test

Challenge: fused_wqa_wkv has TWO weight_scale_2 values (one for q_a, one for kv). Need to handle dual global scales (same pattern as MoE gate+up with different gs).

Phase 3: Clean up

  • Remove all BF16 dequant code
  • Remove vllm/patches/utils.py patch
  • Remove _post_quant_fix() method
  • All NVFP4 goes through our CuTeDSL kernels
  • BF16 only where it must be (SiLU activation, final scatter, embeddings)

NVFP4 Kernel Coverage (Target)

Component Kernel Status
MoE experts (L1+L2) CuTeDSL ScaledGroupedGemm Working
Shared experts (L1+L2) CuTeDSL standard GEMM 🔧 Phase 1
Attention projections CuTeDSL standard GEMM 🔧 Phase 2
wo_a CuTeDSL or keep FP8 🔧 Phase 2
Compressor BF16 (small, not worth it) Done
KV cache FP8 (vLLM, not our concern) Works

Config values

Parameter Value
head_dim 512
num_attention_heads 128
num_key_value_heads 1
q_lora_rank 1536
qk_rope_head_dim 64
o_lora_rank 1024
hc_mult 4
n_routed_experts 384 (48 per EP rank)
shared expert gate_proj [3072, 3584] = 11MB NVFP4 / 22MB BF16
shared expert up_proj [3072, 3584] = 11MB NVFP4 / 22MB BF16
shared expert down_proj [7168, 1536] = 11MB NVFP4 / 22MB BF16
shared expert total 33MB NVFP4 / 66MB BF16 per layer, ~2GB / ~4GB total