CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4

Status: NaN IN MOE — ROOT CAUSE UNKNOWN

Current Symptom

vLLM container starts, model loads, server accepts requests
Output is empty — model generates tokens but they decode to nothing
Debug logs show NaN in hidden_states entering the attention from the FIRST forward pass
NaN propagates through all 61 layers → all outputs are NaN → garbage tokens
Both C128A (cr=128) and C4A (cr=4) layers have NaN in their inputs

NaN Tracing

Layer 0 (C128A): hidden_states input → ??? → NaN in attention input
Layer 1-59 (C4A): NaN in attention input (propagated)
Layer 60 (SWA): NaN in attention input (propagated)

The NaN originates BEFORE the attention — it's in the MoE output that feeds into the next layer.

Architecture: DeepSeek-V4 MegaMoE

384 experts, top-6 routing — this is a "MegaMoE" architecture
DeepGEMM has a specialized mega_moe.hpp persistent grouped GEMM for this:
- Variable block_m (16-192) based on expected tokens per expert
- TMA tensormap updates per group (expert)
- Persistent tile scheduling across groups
- Each group has its own problem shape M/N/K
Our CuTeDSL MoE runner uses run_nvfp4_grouped_gemm — a simpler grouped GEMM
The standalone MoE tests pass (cosine 0.988) but may not exercise the same shapes/paths as vLLM

What's Been Verified (B200 venv, all passing)

Component	Test	Result
NVFP4 Linear (q_a, kv, q_b, o_b)	cosine per projection	0.998-1.0
NVFP4 MoE (L1 gate+up, L2 down)	cosine per layer	0.988
KV cache roundtrip (fp8)	cosine	0.999
Decode attention (1 query vs N KV)	cosine	0.9998
Full pipeline (inv RoPE + o_a + o_b)	cosine	0.996-0.999
All 5 layer types	cosine	≥0.996
E2E 61-layer (shared experts)	logits std=3.16	reasonable
CSA sparse attention (C4A)	cosine	0.974
CSA sparse attention (C128A)	cosine	0.668 (avg-pooled KV)
Multi-step decode	cosine	0.999

What's Been Fixed in vLLM Integration

Compressor fused kernel bypass on Blackwell (_IS_BLACKWELL module flag)
Double Q normalization removed (fused_qnorm only does RoPE now)
RoPE sin slice bug fixed (half:2*half not half:)
fp8 dequant fix (use kv_dequantize_fp8 not .to(bf16))
Wrapper attribute access (self.mla_attn.kv_cache etc.)
Paged KV decode using decode_swa_indices from metadata
UnboundLocalError fix for debug prints

What's NOT Working

Container produces empty/garbage output
NaN in hidden_states from first forward pass
The NaN comes from the MoE (routed experts) or from the activation quantization
The CuTeDSL grouped GEMM may produce NaN for certain expert token distributions

Test Plan — Finding the NaN

Phase 1: Reproduce the NaN in the B200 venv (outside container)

Test CuTeDSLMoERunner.run() with the EXACT same inputs vLLM would provide:
- hidden_states from the embedding + first layer attention
- topk_ids and topk_weights from the router
- Variable token counts per expert (the vLLM padding to 128)
Test with 1 token (decode), 8 tokens (small prefill), and padded shapes
Check for NaN after L1 GEMM, after SiLU activation, after L2 GEMM
Check if quantize_activation_nvfp4 produces NaN for certain input distributions
Check if run_nvfp4_grouped_gemm produces NaN for certain expert offsets

Phase 2: Verify the grouped GEMM with expert-parallel shapes

Test with 48 experts (EP8, 384/8), 1-8 tokens, top-6
Test with padding to 128 rows per expert
Check if the GEMM handles zero-token experts correctly
Check if expert_offsets and padded_expert_offsets are correct for MegaMoE shapes

Phase 3: Test the full layer forward (attention + MoE)

Run layer 0 (C128A) with real weights, check output for NaN
Run layer 2 (C4A) with real weights, check output for NaN
If NaN appears, bisect: which component produces it?

Phase 4: Fix and verify

Fix the NaN source
Run all B200 venv tests
Build container, test with real inference
Verify output is actual text (not empty, not garbage)

Key References

Grouped Blockscaled GEMM on B200 — CuTeDSL persistent grouped GEMM with TMA tensormap updates per group
DeepGEMM mega_moe.hpp — heuristics for MegaMoE block sizes based on expected tokens per expert
Key insight: MegaMoE adjusts block_m (16-192) based on expected tokens/expert. For decode (few tokens), block_m=16-32. For prefill, block_m=192.

4.6 KiB Raw Blame History