Files
nvfp4-megamoe-kernel/CURRENT_BUG.md
biondizzle 334e95047e Fix: dequantize ALL attention NVFP4 projections to BF16
Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.

This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).
2026-05-18 13:09:36 +00:00

2.8 KiB

Current Bug: vLLM produces NaN from layer 0

Status: ROOT CAUSE IDENTIFIED Date: 2026-05-18

Symptom

  • vLLM server starts, loads model, but every inference produces NaN logits
  • Diagnostic prints show NaN from layer 0 onward — no layer ever produces valid output

Root Cause

The attention NVFP4 linear layers produce NaN immediately.

The attention projections (q_a_proj, q_b_proj, kv_proj, o_a_proj, o_b_proj) go through vLLM's FlashInferCutlassNvFp4LinearKernel which calls:

x_fp4, x_blockscale = scaled_fp4_quant(x, layer.input_global_scale_inv, ...)

input_global_scale_inv comes from the checkpoint input_scale field. For MoE, we override this with a warmup. For attention, there's no warmup — it uses the raw checkpoint value.

The CompressedTensorsW4A4Fp4.process_weights_after_loading sets:

input_global_scale_inv = layer.input_scale.max().to(torch.float32)  # = 0.00025141
layer.alpha = input_global_scale * layer.weight_global_scale

For q_a_proj: input_scale = 0.00025141, meaning 1/input_scale = 3977.6. The activation quantization divides by 0.00025141 (multiplies by 3977.6). For typical activations with amax ~2-8, this produces values far beyond FP4 range (max 6.0), causing NaN via overflow.

Evidence

  1. MoE kernel is finetest_runner_vllm_style.py with warmup gs gives cosine 0.988
  2. NaN from layer 0 — diagnostic prints show ALL layers from 0 produce NaN
  3. Attention weights dequantize finetest_attn_weights.py shows no NaN from dequantized BF16 matmul
  4. The problem is in the NVFP4 activation quantization, not the weights

Fix

The attention input_scale needs the same warmup-based override we did for MoE, OR the input_scale values need to be validated/corrected.

Options:

  1. Add warmup for attention input_global_scale_inv — same pattern as MoE: run a dummy forward, capture actual activation amax, compute correct gs
  2. Dequantize attention weights to BF16 (like compressor weights) — avoids NVFP4 activation quantization entirely, at the cost of more memory
  3. Fix the checkpoint input_scale — if the values are wrong, re-calibrate

Option 2 is the quickest path — dequantize attention NVFP4 weights to BF16 at load time (the _dequant_nvfp4_to_bf16 method already exists). This trades memory for correctness.

Progress

  • Removed NaN check (Dynamo incompatible)
  • vLLM container starts and loads model
  • Confirmed NaN logits from completions API
  • MoE kernel: cosine 0.988 with warmup gs — NOT the problem
  • NaN starts at layer 0 — attention is the source
  • Root cause: attention NVFP4 input_scale from checkpoint produces NaN during activation quantization
  • Next: Fix attention NVFP4 path — dequant to BF16 or add warmup