4.6 KiB
4.6 KiB
CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4
Status: NaN IN MOE — ROOT CAUSE UNKNOWN
Current Symptom
- vLLM container starts, model loads, server accepts requests
- Output is empty — model generates tokens but they decode to nothing
- Debug logs show NaN in hidden_states entering the attention from the FIRST forward pass
- NaN propagates through all 61 layers → all outputs are NaN → garbage tokens
- Both C128A (cr=128) and C4A (cr=4) layers have NaN in their inputs
NaN Tracing
Layer 0 (C128A): hidden_states input → ??? → NaN in attention input
Layer 1-59 (C4A): NaN in attention input (propagated)
Layer 60 (SWA): NaN in attention input (propagated)
The NaN originates BEFORE the attention — it's in the MoE output that feeds into the next layer.
Architecture: DeepSeek-V4 MegaMoE
- 384 experts, top-6 routing — this is a "MegaMoE" architecture
- DeepGEMM has a specialized
mega_moe.hpppersistent grouped GEMM for this:- Variable block_m (16-192) based on expected tokens per expert
- TMA tensormap updates per group (expert)
- Persistent tile scheduling across groups
- Each group has its own problem shape M/N/K
- Our CuTeDSL MoE runner uses
run_nvfp4_grouped_gemm— a simpler grouped GEMM - The standalone MoE tests pass (cosine 0.988) but may not exercise the same shapes/paths as vLLM
What's Been Verified (B200 venv, all passing)
| Component | Test | Result |
|---|---|---|
| NVFP4 Linear (q_a, kv, q_b, o_b) | cosine per projection | 0.998-1.0 |
| NVFP4 MoE (L1 gate+up, L2 down) | cosine per layer | 0.988 |
| KV cache roundtrip (fp8) | cosine | 0.999 |
| Decode attention (1 query vs N KV) | cosine | 0.9998 |
| Full pipeline (inv RoPE + o_a + o_b) | cosine | 0.996-0.999 |
| All 5 layer types | cosine | ≥0.996 |
| E2E 61-layer (shared experts) | logits std=3.16 | reasonable |
| CSA sparse attention (C4A) | cosine | 0.974 |
| CSA sparse attention (C128A) | cosine | 0.668 (avg-pooled KV) |
| Multi-step decode | cosine | 0.999 |
What's Been Fixed in vLLM Integration
- Compressor fused kernel bypass on Blackwell (
_IS_BLACKWELLmodule flag) - Double Q normalization removed (fused_qnorm only does RoPE now)
- RoPE sin slice bug fixed (
half:2*halfnothalf:) - fp8 dequant fix (use
kv_dequantize_fp8not.to(bf16)) - Wrapper attribute access (
self.mla_attn.kv_cacheetc.) - Paged KV decode using
decode_swa_indicesfrom metadata UnboundLocalErrorfix for debug prints
What's NOT Working
- Container produces empty/garbage output
- NaN in hidden_states from first forward pass
- The NaN comes from the MoE (routed experts) or from the activation quantization
- The CuTeDSL grouped GEMM may produce NaN for certain expert token distributions
Test Plan — Finding the NaN
Phase 1: Reproduce the NaN in the B200 venv (outside container)
- Test
CuTeDSLMoERunner.run()with the EXACT same inputs vLLM would provide:hidden_statesfrom the embedding + first layer attentiontopk_idsandtopk_weightsfrom the router- Variable token counts per expert (the vLLM padding to 128)
- Test with 1 token (decode), 8 tokens (small prefill), and padded shapes
- Check for NaN after L1 GEMM, after SiLU activation, after L2 GEMM
- Check if
quantize_activation_nvfp4produces NaN for certain input distributions - Check if
run_nvfp4_grouped_gemmproduces NaN for certain expert offsets
Phase 2: Verify the grouped GEMM with expert-parallel shapes
- Test with 48 experts (EP8, 384/8), 1-8 tokens, top-6
- Test with padding to 128 rows per expert
- Check if the GEMM handles zero-token experts correctly
- Check if
expert_offsetsandpadded_expert_offsetsare correct for MegaMoE shapes
Phase 3: Test the full layer forward (attention + MoE)
- Run layer 0 (C128A) with real weights, check output for NaN
- Run layer 2 (C4A) with real weights, check output for NaN
- If NaN appears, bisect: which component produces it?
Phase 4: Fix and verify
- Fix the NaN source
- Run all B200 venv tests
- Build container, test with real inference
- Verify output is actual text (not empty, not garbage)
Key References
- Grouped Blockscaled GEMM on B200 — CuTeDSL persistent grouped GEMM with TMA tensormap updates per group
- DeepGEMM mega_moe.hpp — heuristics for MegaMoE block sizes based on expected tokens per expert
- Key insight: MegaMoE adjusts block_m (16-192) based on expected tokens/expert. For decode (few tokens), block_m=16-32. For prefill, block_m=192.