3.1 KiB
Current Bug: vLLM produces NaN from layer 0
Status: Active debugging — BF16 dequant fix in progress Date: 2026-05-18
Symptom
- vLLM server starts, loads model, but every inference produces NaN logits
- Diagnostic prints show NaN from layer 0 onward — no layer ever produces valid output
- Empty content in chat completions, NaN in logprobs
Root Cause (in progress)
The attention NVFP4 linear layers produce NaN immediately.
The attention projections go through vLLM's FlashInferCutlassNvFp4LinearKernel which uses checkpoint input_scale as the activation global scale for scaled_fp4_quant(). The checkpoint input_scale values are wrong for this use case, causing overflow → NaN.
What we've tried
- ✅ MoE kernel is NOT the problem —
test_runner_vllm_style.pywith warmup gs gives cosine 0.988, no NaN - ❌ Dequant ALL attn projections to BF16 — crashed:
wo_a.weight_scale_invmissing (fp8_einsum needs it) - ❌ Dequant all except wo_a (keep wo_a as FP8) — still NaN from layer 0.
wq_aandwkvdon't exist as separate attrs — they're fused asfused_wqa_wkv - ❌ Changed to dequant
fused_wqa_wkv— still NaN from layer 0. Debug prints added to check if the attrs are actually found.
Current theory
The BF16 dequant code may not be finding fused_wqa_wkv on the attention module, so it silently skips the most important projection. Debug logging added in latest commit to verify.
Attention architecture (DeepSeek V4 MLA)
fused_wqa_wkv— MergedColumnParallelLinear (q_a + kv fused)wq_b— ColumnParallelLinear (second Q projection after RoPE)wo_a— ColumnParallelLinear (FP8 via fp8_einsum, weight-only, NO input_scale)wo_b— ColumnParallelLinear (final output projection)compressor— already handled (reconstructed to BF16 from checkpoint)
Why wo_a is safe as FP8
wo_a uses fp8_einsum which does output = fp8_act * fp8_weight * scale. It's a weight-only FP8 GEMM — no input_scale involved. The NaN comes from scaled_fp4_quant(x, input_global_scale_inv) in the other projections.
Key evidence
q_a_proj.input_scale = 0.00025141→1/input_scale = 3977.6→ quantizing activations with amax ~2-8 by 3977.6x = massive overflowq_b_proj.input_scale = 0.00006140→1/input_scale = 16287.1→ even worse- Embedding values: amax=1.27, std=0.09 — very small values that get multiplied by thousands during quantization
Next steps
- Check debug logs to see which projections were actually dequantized
- If
fused_wqa_wkvwasn't found, fix the attribute path - If it was found and dequantized, the NaN source is elsewhere (wo_b? wq_b? something else?)
- Consider: maybe the NaN is from the KV cache FP8 quantization or the RoPE implementation
Docker/Build Notes
- Build:
screen -dmS build bash -c './build_and_run.sh 2>&1 | tee build.log' - Currently using
--enforce-eager+CLAWMINE_DEBUG=1for diagnostics - Don't hit the API with enforce-eager — JIT spikes crash the container
- For real testing: use compilation-config
{"cudagraph_mode": "NONE", "custom_ops": ["all"]}instead of enforce-eager