Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

4.6 KiB

CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4

Status: NaN IN MOE — ROOT CAUSE UNKNOWN

Current Symptom

  • vLLM container starts, model loads, server accepts requests
  • Output is empty — model generates tokens but they decode to nothing
  • Debug logs show NaN in hidden_states entering the attention from the FIRST forward pass
  • NaN propagates through all 61 layers → all outputs are NaN → garbage tokens
  • Both C128A (cr=128) and C4A (cr=4) layers have NaN in their inputs

NaN Tracing

Layer 0 (C128A): hidden_states input → ??? → NaN in attention input
Layer 1-59 (C4A): NaN in attention input (propagated)
Layer 60 (SWA): NaN in attention input (propagated)

The NaN originates BEFORE the attention — it's in the MoE output that feeds into the next layer.

Architecture: DeepSeek-V4 MegaMoE

  • 384 experts, top-6 routing — this is a "MegaMoE" architecture
  • DeepGEMM has a specialized mega_moe.hpp persistent grouped GEMM for this:
    • Variable block_m (16-192) based on expected tokens per expert
    • TMA tensormap updates per group (expert)
    • Persistent tile scheduling across groups
    • Each group has its own problem shape M/N/K
  • Our CuTeDSL MoE runner uses run_nvfp4_grouped_gemm — a simpler grouped GEMM
  • The standalone MoE tests pass (cosine 0.988) but may not exercise the same shapes/paths as vLLM

What's Been Verified (B200 venv, all passing)

Component Test Result
NVFP4 Linear (q_a, kv, q_b, o_b) cosine per projection 0.998-1.0
NVFP4 MoE (L1 gate+up, L2 down) cosine per layer 0.988
KV cache roundtrip (fp8) cosine 0.999
Decode attention (1 query vs N KV) cosine 0.9998
Full pipeline (inv RoPE + o_a + o_b) cosine 0.996-0.999
All 5 layer types cosine ≥0.996
E2E 61-layer (shared experts) logits std=3.16 reasonable
CSA sparse attention (C4A) cosine 0.974
CSA sparse attention (C128A) cosine 0.668 (avg-pooled KV)
Multi-step decode cosine 0.999

What's Been Fixed in vLLM Integration

  1. Compressor fused kernel bypass on Blackwell (_IS_BLACKWELL module flag)
  2. Double Q normalization removed (fused_qnorm only does RoPE now)
  3. RoPE sin slice bug fixed (half:2*half not half:)
  4. fp8 dequant fix (use kv_dequantize_fp8 not .to(bf16))
  5. Wrapper attribute access (self.mla_attn.kv_cache etc.)
  6. Paged KV decode using decode_swa_indices from metadata
  7. UnboundLocalError fix for debug prints

What's NOT Working

  • Container produces empty/garbage output
  • NaN in hidden_states from first forward pass
  • The NaN comes from the MoE (routed experts) or from the activation quantization
  • The CuTeDSL grouped GEMM may produce NaN for certain expert token distributions

Test Plan — Finding the NaN

Phase 1: Reproduce the NaN in the B200 venv (outside container)

  1. Test CuTeDSLMoERunner.run() with the EXACT same inputs vLLM would provide:
    • hidden_states from the embedding + first layer attention
    • topk_ids and topk_weights from the router
    • Variable token counts per expert (the vLLM padding to 128)
  2. Test with 1 token (decode), 8 tokens (small prefill), and padded shapes
  3. Check for NaN after L1 GEMM, after SiLU activation, after L2 GEMM
  4. Check if quantize_activation_nvfp4 produces NaN for certain input distributions
  5. Check if run_nvfp4_grouped_gemm produces NaN for certain expert offsets

Phase 2: Verify the grouped GEMM with expert-parallel shapes

  1. Test with 48 experts (EP8, 384/8), 1-8 tokens, top-6
  2. Test with padding to 128 rows per expert
  3. Check if the GEMM handles zero-token experts correctly
  4. Check if expert_offsets and padded_expert_offsets are correct for MegaMoE shapes

Phase 3: Test the full layer forward (attention + MoE)

  1. Run layer 0 (C128A) with real weights, check output for NaN
  2. Run layer 2 (C4A) with real weights, check output for NaN
  3. If NaN appears, bisect: which component produces it?

Phase 4: Fix and verify

  1. Fix the NaN source
  2. Run all B200 venv tests
  3. Build container, test with real inference
  4. Verify output is actual text (not empty, not garbage)

Key References

  • Grouped Blockscaled GEMM on B200 — CuTeDSL persistent grouped GEMM with TMA tensormap updates per group
  • DeepGEMM mega_moe.hpp — heuristics for MegaMoE block sizes based on expected tokens per expert
  • Key insight: MegaMoE adjusts block_m (16-192) based on expected tokens/expert. For decode (few tokens), block_m=16-32. For prefill, block_m=192.