Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

4.0 KiB
Raw Blame History

CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4

Status: NaN in vLLM Container — Source is vLLM Infrastructure, NOT Our Kernels

Symptom

  • vLLM container starts, model loads, server accepts requests
  • Output is empty — model generates tokens but they decode to nothing
  • Debug logs show NaN in hidden_states entering the attention from the first forward pass
  • NaN propagates through all 61 layers → all outputs are NaN → garbage tokens

Root Cause Investigation

Our kernels are NOT the source of NaN. Every component has been tested standalone on the B200 venv with real weights and zero NaN:

Test Result
Single expert (gate+up+down) × 4 experts No NaN, all token counts
Activation quantization (quantize_activation_nvfp4) No NaN
CuTeDSL MoE runner (grouped GEMM, 16 experts) No NaN, all token counts
Full layer (attention + MoE + shared expert) No NaN
Multi-layer chain (C128A → C4A → SWA, shared experts) No NaN

The NaN comes from vLLM's compiled execution infrastructure, specifically one of:

  1. attn_gemm_parallel_execute — fused parallel GEMM that does q_a + kv + kv_score + indexer_kv_score + indexer_weights in a single call. This is MergedColumnParallelLinear, NOT our CuTeDSL kernel. On Blackwell, the out_dtype=torch.float32 or the FP8 quantization in this kernel may produce NaN.

  2. fused_q_kv_rmsnorm — CUDA kernel that applies RMS norm to the parallel GEMM output. May produce NaN if the input has extreme values from the parallel GEMM.

  3. Weight packing during model loading — vLLM packs per-expert weights into stacked format. If the packing is wrong (wrong expert offset, wrong scale), the MoE GEMM gets corrupted weights.

  4. torch.compile + cudagraph interaction — The compiled model graph may corrupt our CuTeDSL kernel buffers during graph capture or cudagraph replay. The _needs_token_refill flag exists because CuTeDSL's cute.compile zeroes GPU memory during JIT.

NaN Tracing (from container debug logs)

hidden_states input → NaN (propagated from previous layer)
  ├── Layer 0 (C128A): attention input NaN=False, but output may have NaN after MoE
  ├── Layer 1-59 (C4A): attention input NaN=True (propagated)
  └── Layer 60 (SWA): attention input NaN=True (propagated)

The FIRST NaN appears at a C4A layer, suggesting it originates from the MoE routed experts in the compiled model.

Next Steps

  1. Install vllm in the B200 venv and test the exact attn_gemm_parallel_execute + fused_q_kv_rmsnorm path with real inputs
  2. Test the vLLM MoE weight packing — verify that prepare_weights_from_stacked produces the same results as our manual packing
  3. Test with torch.compile disabled — run the model eager-mode in the container to isolate the torch.compile interaction
  4. Add NaN checks inside the parallel GEMM — wrap attn_gemm_parallel_execute with NaN detection to pinpoint the exact source

What's Been Verified and Fixed (Attention Pipeline)

All B200 venv tests pass with cosine 0.996-0.999:

  • KV cache write (RoPE → fp8 quant → paged cache)
  • KV cache read (paged cache → fp8 dequant → BF16)
  • Decode attention (1 query vs N cached KVs)
  • Full pipeline (inv RoPE + o_a BMM + o_b)
  • All 5 layer types (C128A, C4A, SWA)

vLLM integration fixes applied:

  1. Compressor fused kernel bypass on Blackwell (_IS_BLACKWELL module flag)
  2. Double Q normalization removed (fused_qnorm only does RoPE)
  3. RoPE sin slice bug fixed
  4. fp8 dequant fix (proper kv_dequantize_fp8)
  5. Wrapper attribute access via self.mla_attn
  6. Paged KV decode using decode_swa_indices from metadata

Architecture Notes

  • DeepSeek-V4 is MegaMoE (384 experts, top-6)
  • DeepGEMM has a specialized persistent grouped GEMM for MegaMoE with TMA tensormap updates per expert
  • Our CuTeDSL MoE runner uses run_nvfp4_grouped_gemm (simpler grouped GEMM, but proven correct)
  • The expert intermediate size is 3072 (not 18432 — that's the total for 6 experts × 3072)