Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup).
2.8 KiB
Current Bug: vLLM produces NaN from layer 0
Status: ROOT CAUSE IDENTIFIED Date: 2026-05-18
Symptom
- vLLM server starts, loads model, but every inference produces NaN logits
- Diagnostic prints show NaN from layer 0 onward — no layer ever produces valid output
Root Cause
The attention NVFP4 linear layers produce NaN immediately.
The attention projections (q_a_proj, q_b_proj, kv_proj, o_a_proj, o_b_proj) go through vLLM's FlashInferCutlassNvFp4LinearKernel which calls:
x_fp4, x_blockscale = scaled_fp4_quant(x, layer.input_global_scale_inv, ...)
input_global_scale_inv comes from the checkpoint input_scale field. For MoE, we override this with a warmup. For attention, there's no warmup — it uses the raw checkpoint value.
The CompressedTensorsW4A4Fp4.process_weights_after_loading sets:
input_global_scale_inv = layer.input_scale.max().to(torch.float32) # = 0.00025141
layer.alpha = input_global_scale * layer.weight_global_scale
For q_a_proj: input_scale = 0.00025141, meaning 1/input_scale = 3977.6. The activation quantization divides by 0.00025141 (multiplies by 3977.6). For typical activations with amax ~2-8, this produces values far beyond FP4 range (max 6.0), causing NaN via overflow.
Evidence
- MoE kernel is fine —
test_runner_vllm_style.pywith warmup gs gives cosine 0.988 - NaN from layer 0 — diagnostic prints show ALL layers from 0 produce NaN
- Attention weights dequantize fine —
test_attn_weights.pyshows no NaN from dequantized BF16 matmul - The problem is in the NVFP4 activation quantization, not the weights
Fix
The attention input_scale needs the same warmup-based override we did for MoE, OR the input_scale values need to be validated/corrected.
Options:
- Add warmup for attention
input_global_scale_inv— same pattern as MoE: run a dummy forward, capture actual activation amax, compute correct gs - Dequantize attention weights to BF16 (like compressor weights) — avoids NVFP4 activation quantization entirely, at the cost of more memory
- Fix the checkpoint input_scale — if the values are wrong, re-calibrate
Option 2 is the quickest path — dequantize attention NVFP4 weights to BF16 at load time (the _dequant_nvfp4_to_bf16 method already exists). This trades memory for correctness.
Progress
- Removed NaN check (Dynamo incompatible)
- vLLM container starts and loads model
- Confirmed NaN logits from completions API
- MoE kernel: cosine 0.988 with warmup gs — NOT the problem
- NaN starts at layer 0 — attention is the source
- Root cause: attention NVFP4
input_scalefrom checkpoint produces NaN during activation quantization - Next: Fix attention NVFP4 path — dequant to BF16 or add warmup