Files
nvfp4-megamoe-kernel/CURRENT_BUG.md

3.1 KiB

Current Bug: vLLM produces NaN from layer 0

Status: Active debugging — BF16 dequant fix in progress Date: 2026-05-18

Symptom

  • vLLM server starts, loads model, but every inference produces NaN logits
  • Diagnostic prints show NaN from layer 0 onward — no layer ever produces valid output
  • Empty content in chat completions, NaN in logprobs

Root Cause (in progress)

The attention NVFP4 linear layers produce NaN immediately.

The attention projections go through vLLM's FlashInferCutlassNvFp4LinearKernel which uses checkpoint input_scale as the activation global scale for scaled_fp4_quant(). The checkpoint input_scale values are wrong for this use case, causing overflow → NaN.

What we've tried

  1. MoE kernel is NOT the problemtest_runner_vllm_style.py with warmup gs gives cosine 0.988, no NaN
  2. Dequant ALL attn projections to BF16 — crashed: wo_a.weight_scale_inv missing (fp8_einsum needs it)
  3. Dequant all except wo_a (keep wo_a as FP8) — still NaN from layer 0. wq_a and wkv don't exist as separate attrs — they're fused as fused_wqa_wkv
  4. Changed to dequant fused_wqa_wkv — still NaN from layer 0. Debug prints added to check if the attrs are actually found.

Current theory

The BF16 dequant code may not be finding fused_wqa_wkv on the attention module, so it silently skips the most important projection. Debug logging added in latest commit to verify.

Attention architecture (DeepSeek V4 MLA)

  • fused_wqa_wkv — MergedColumnParallelLinear (q_a + kv fused)
  • wq_b — ColumnParallelLinear (second Q projection after RoPE)
  • wo_a — ColumnParallelLinear (FP8 via fp8_einsum, weight-only, NO input_scale)
  • wo_b — ColumnParallelLinear (final output projection)
  • compressor — already handled (reconstructed to BF16 from checkpoint)

Why wo_a is safe as FP8

wo_a uses fp8_einsum which does output = fp8_act * fp8_weight * scale. It's a weight-only FP8 GEMM — no input_scale involved. The NaN comes from scaled_fp4_quant(x, input_global_scale_inv) in the other projections.

Key evidence

  • q_a_proj.input_scale = 0.000251411/input_scale = 3977.6 → quantizing activations with amax ~2-8 by 3977.6x = massive overflow
  • q_b_proj.input_scale = 0.000061401/input_scale = 16287.1 → even worse
  • Embedding values: amax=1.27, std=0.09 — very small values that get multiplied by thousands during quantization

Next steps

  1. Check debug logs to see which projections were actually dequantized
  2. If fused_wqa_wkv wasn't found, fix the attribute path
  3. If it was found and dequantized, the NaN source is elsewhere (wo_b? wq_b? something else?)
  4. Consider: maybe the NaN is from the KV cache FP8 quantization or the RoPE implementation

Docker/Build Notes

  • Build: screen -dmS build bash -c './build_and_run.sh 2>&1 | tee build.log'
  • Currently using --enforce-eager + CLAWMINE_DEBUG=1 for diagnostics
  • Don't hit the API with enforce-eager — JIT spikes crash the container
  • For real testing: use compilation-config {"cudagraph_mode": "NONE", "custom_ops": ["all"]} instead of enforce-eager