4.0 KiB
CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4
Status: NaN in vLLM Container — Source is vLLM Infrastructure, NOT Our Kernels
Symptom
- vLLM container starts, model loads, server accepts requests
- Output is empty — model generates tokens but they decode to nothing
- Debug logs show NaN in hidden_states entering the attention from the first forward pass
- NaN propagates through all 61 layers → all outputs are NaN → garbage tokens
Root Cause Investigation
Our kernels are NOT the source of NaN. Every component has been tested standalone on the B200 venv with real weights and zero NaN:
| Test | Result |
|---|---|
| Single expert (gate+up+down) × 4 experts | ✅ No NaN, all token counts |
Activation quantization (quantize_activation_nvfp4) |
✅ No NaN |
| CuTeDSL MoE runner (grouped GEMM, 16 experts) | ✅ No NaN, all token counts |
| Full layer (attention + MoE + shared expert) | ✅ No NaN |
| Multi-layer chain (C128A → C4A → SWA, shared experts) | ✅ No NaN |
The NaN comes from vLLM's compiled execution infrastructure, specifically one of:
-
attn_gemm_parallel_execute— fused parallel GEMM that does q_a + kv + kv_score + indexer_kv_score + indexer_weights in a single call. This isMergedColumnParallelLinear, NOT our CuTeDSL kernel. On Blackwell, theout_dtype=torch.float32or the FP8 quantization in this kernel may produce NaN. -
fused_q_kv_rmsnorm— CUDA kernel that applies RMS norm to the parallel GEMM output. May produce NaN if the input has extreme values from the parallel GEMM. -
Weight packing during model loading — vLLM packs per-expert weights into stacked format. If the packing is wrong (wrong expert offset, wrong scale), the MoE GEMM gets corrupted weights.
-
torch.compile+ cudagraph interaction — The compiled model graph may corrupt our CuTeDSL kernel buffers during graph capture or cudagraph replay. The_needs_token_refillflag exists because CuTeDSL'scute.compilezeroes GPU memory during JIT.
NaN Tracing (from container debug logs)
hidden_states input → NaN (propagated from previous layer)
├── Layer 0 (C128A): attention input NaN=False, but output may have NaN after MoE
├── Layer 1-59 (C4A): attention input NaN=True (propagated)
└── Layer 60 (SWA): attention input NaN=True (propagated)
The FIRST NaN appears at a C4A layer, suggesting it originates from the MoE routed experts in the compiled model.
Next Steps
- Install vllm in the B200 venv and test the exact
attn_gemm_parallel_execute+fused_q_kv_rmsnormpath with real inputs - Test the vLLM MoE weight packing — verify that
prepare_weights_from_stackedproduces the same results as our manual packing - Test with
torch.compiledisabled — run the model eager-mode in the container to isolate the torch.compile interaction - Add NaN checks inside the parallel GEMM — wrap
attn_gemm_parallel_executewith NaN detection to pinpoint the exact source
What's Been Verified and Fixed (Attention Pipeline)
All B200 venv tests pass with cosine 0.996-0.999:
- KV cache write (RoPE → fp8 quant → paged cache)
- KV cache read (paged cache → fp8 dequant → BF16)
- Decode attention (1 query vs N cached KVs)
- Full pipeline (inv RoPE + o_a BMM + o_b)
- All 5 layer types (C128A, C4A, SWA)
vLLM integration fixes applied:
- Compressor fused kernel bypass on Blackwell (
_IS_BLACKWELLmodule flag) - Double Q normalization removed (fused_qnorm only does RoPE)
- RoPE sin slice bug fixed
- fp8 dequant fix (proper
kv_dequantize_fp8) - Wrapper attribute access via
self.mla_attn - Paged KV decode using
decode_swa_indicesfrom metadata
Architecture Notes
- DeepSeek-V4 is MegaMoE (384 experts, top-6)
- DeepGEMM has a specialized persistent grouped GEMM for MegaMoE with TMA tensormap updates per expert
- Our CuTeDSL MoE runner uses
run_nvfp4_grouped_gemm(simpler grouped GEMM, but proven correct) - The expert intermediate size is 3072 (not 18432 — that's the total for 6 experts × 3072)