CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4

Status: NaN in vLLM Container — Source is vLLM Infrastructure, NOT Our Kernels

Symptom

vLLM container starts, model loads, server accepts requests
Output is empty — model generates tokens but they decode to nothing
Debug logs show NaN in hidden_states entering the attention from the first forward pass
NaN propagates through all 61 layers → all outputs are NaN → garbage tokens

Root Cause Investigation

Our kernels are NOT the source of NaN. Every component has been tested standalone on the B200 venv with real weights and zero NaN:

Test	Result
Single expert (gate+up+down) × 4 experts	✅ No NaN, all token counts
Activation quantization (`quantize_activation_nvfp4`)	✅ No NaN
CuTeDSL MoE runner (grouped GEMM, 16 experts)	✅ No NaN, all token counts
Full layer (attention + MoE + shared expert)	✅ No NaN
Multi-layer chain (C128A → C4A → SWA, shared experts)	✅ No NaN

The NaN comes from vLLM's compiled execution infrastructure, specifically one of:

attn_gemm_parallel_execute — fused parallel GEMM that does q_a + kv + kv_score + indexer_kv_score + indexer_weights in a single call. This is MergedColumnParallelLinear, NOT our CuTeDSL kernel. On Blackwell, the out_dtype=torch.float32 or the FP8 quantization in this kernel may produce NaN.
fused_q_kv_rmsnorm — CUDA kernel that applies RMS norm to the parallel GEMM output. May produce NaN if the input has extreme values from the parallel GEMM.
Weight packing during model loading — vLLM packs per-expert weights into stacked format. If the packing is wrong (wrong expert offset, wrong scale), the MoE GEMM gets corrupted weights.
torch.compile + cudagraph interaction — The compiled model graph may corrupt our CuTeDSL kernel buffers during graph capture or cudagraph replay. The _needs_token_refill flag exists because CuTeDSL's cute.compile zeroes GPU memory during JIT.

NaN Tracing (from container debug logs)

hidden_states input → NaN (propagated from previous layer)
  ├── Layer 0 (C128A): attention input NaN=False, but output may have NaN after MoE
  ├── Layer 1-59 (C4A): attention input NaN=True (propagated)
  └── Layer 60 (SWA): attention input NaN=True (propagated)

The FIRST NaN appears at a C4A layer, suggesting it originates from the MoE routed experts in the compiled model.

Next Steps

Install vllm in the B200 venv and test the exact attn_gemm_parallel_execute + fused_q_kv_rmsnorm path with real inputs
Test the vLLM MoE weight packing — verify that prepare_weights_from_stacked produces the same results as our manual packing
Test with torch.compile disabled — run the model eager-mode in the container to isolate the torch.compile interaction
Add NaN checks inside the parallel GEMM — wrap attn_gemm_parallel_execute with NaN detection to pinpoint the exact source

What's Been Verified and Fixed (Attention Pipeline)

All B200 venv tests pass with cosine 0.996-0.999:

KV cache write (RoPE → fp8 quant → paged cache)
KV cache read (paged cache → fp8 dequant → BF16)
Decode attention (1 query vs N cached KVs)
Full pipeline (inv RoPE + o_a BMM + o_b)
All 5 layer types (C128A, C4A, SWA)

vLLM integration fixes applied:

Compressor fused kernel bypass on Blackwell (_IS_BLACKWELL module flag)
Double Q normalization removed (fused_qnorm only does RoPE)
RoPE sin slice bug fixed
fp8 dequant fix (proper kv_dequantize_fp8)
Wrapper attribute access via self.mla_attn
Paged KV decode using decode_swa_indices from metadata

Architecture Notes

DeepSeek-V4 is MegaMoE (384 experts, top-6)
DeepGEMM has a specialized persistent grouped GEMM for MegaMoE with TMA tensormap updates per expert
Our CuTeDSL MoE runner uses run_nvfp4_grouped_gemm (simpler grouped GEMM, but proven correct)
The expert intermediate size is 3072 (not 18432 — that's the total for 6 experts × 3072)

4.0 KiB Raw Blame History Unescape Escape