Current Bug: vLLM produces NaN from layer 0

Status: Active debugging — BF16 dequant fix in progress Date: 2026-05-18

Symptom

vLLM server starts, loads model, but every inference produces NaN logits
Diagnostic prints show NaN from layer 0 onward — no layer ever produces valid output
Empty content in chat completions, NaN in logprobs

Root Cause (in progress)

The attention NVFP4 linear layers produce NaN immediately.

The attention projections go through vLLM's FlashInferCutlassNvFp4LinearKernel which uses checkpoint input_scale as the activation global scale for scaled_fp4_quant(). The checkpoint input_scale values are wrong for this use case, causing overflow → NaN.

What we've tried

✅ MoE kernel is NOT the problem — test_runner_vllm_style.py with warmup gs gives cosine 0.988, no NaN
❌ Dequant ALL attn projections to BF16 — crashed: wo_a.weight_scale_inv missing (fp8_einsum needs it)
❌ Dequant all except wo_a (keep wo_a as FP8) — still NaN from layer 0. wq_a and wkv don't exist as separate attrs — they're fused as fused_wqa_wkv
❌ Changed to dequant fused_wqa_wkv — still NaN from layer 0. Debug prints added to check if the attrs are actually found.

Current theory

The BF16 dequant code may not be finding fused_wqa_wkv on the attention module, so it silently skips the most important projection. Debug logging added in latest commit to verify.

Attention architecture (DeepSeek V4 MLA)

fused_wqa_wkv — MergedColumnParallelLinear (q_a + kv fused)
wq_b — ColumnParallelLinear (second Q projection after RoPE)
wo_a — ColumnParallelLinear (FP8 via fp8_einsum, weight-only, NO input_scale)
wo_b — ColumnParallelLinear (final output projection)
compressor — already handled (reconstructed to BF16 from checkpoint)

Why `wo_a` is safe as FP8

wo_a uses fp8_einsum which does output = fp8_act * fp8_weight * scale. It's a weight-only FP8 GEMM — no input_scale involved. The NaN comes from scaled_fp4_quant(x, input_global_scale_inv) in the other projections.

Key evidence

q_a_proj.input_scale = 0.00025141 → 1/input_scale = 3977.6 → quantizing activations with amax ~2-8 by 3977.6x = massive overflow
q_b_proj.input_scale = 0.00006140 → 1/input_scale = 16287.1 → even worse
Embedding values: amax=1.27, std=0.09 — very small values that get multiplied by thousands during quantization

Next steps

Check debug logs to see which projections were actually dequantized
If fused_wqa_wkv wasn't found, fix the attribute path
If it was found and dequantized, the NaN source is elsewhere (wo_b? wq_b? something else?)
Consider: maybe the NaN is from the KV cache FP8 quantization or the RoPE implementation

Docker/Build Notes

Build: screen -dmS build bash -c './build_and_run.sh 2>&1 | tee build.log'
Currently using --enforce-eager + CLAWMINE_DEBUG=1 for diagnostics
Don't hit the API with enforce-eager — JIT spikes crash the container
For real testing: use compilation-config {"cudagraph_mode": "NONE", "custom_ops": ["all"]} instead of enforce-eager

3.1 KiB Raw Blame History