Fix: dequantize ALL attention NVFP4 projections to BF16

Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel
uses checkpoint input_scale for activation quantization, which produces
NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a,
wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken
input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method.

This trades memory for correctness. Future optimization: add warmup
for attention input_global_scale_inv (same as MoE warmup).
This commit is contained in:
2026-05-18 13:09:36 +00:00
parent a83c332059
commit 334e95047e
2 changed files with 46 additions and 71 deletions

View File

@@ -1,79 +1,55 @@
# Current Bug: vLLM produces empty/garbage output with NaN logits
# Current Bug: vLLM produces NaN from layer 0
**Status:** Active debugging
**Status:** ROOT CAUSE IDENTIFIED
**Date:** 2026-05-18
## Symptom
- vLLM server starts successfully, loads model, captures cudagraph
- Chat completions return `content: ""` with `finish_reason: "length"`
- Raw completions API returns: `Out of range float values are not JSON compliant: nan`
- 50 completion tokens generated but all produce NaN logits
- Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8)
- vLLM server starts, loads model, but every inference produces NaN logits
- Diagnostic prints show **NaN from layer 0 onward** — no layer ever produces valid output
## Known Good
- `layertest.py` passes (cosine 0.988 with BF16 reference) — MoE kernel math is correct
- `cudagraph_test.py` passes — no CPU-GPU syncs, capture + replay works
- Model weights load successfully (281K tensors)
- Kernel compiles and runs without CUDA errors
## Root Cause
## Hypotheses
**The attention NVFP4 linear layers produce NaN immediately.**
### H1: Activation global scale (gs) is wrong
- `compute_activation_global_scales` is called during init with **random data** (torch.randn)
- Random data may produce gs values that don't represent real token distributions
- If gs is too small: activation quantization clips, info loss
- If gs is too large: quantization noise dominates
- [ ] **Test:** Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs
- [ ] **Test:** Run runner on real token data outside vLLM, check for NaN/garbage
The attention projections (`q_a_proj`, `q_b_proj`, `kv_proj`, `o_a_proj`, `o_b_proj`) go through vLLM's `FlashInferCutlassNvFp4LinearKernel` which calls:
```python
x_fp4, x_blockscale = scaled_fp4_quant(x, layer.input_global_scale_inv, ...)
```
### H2: Attention layer produces bad hidden states before MoE
- If attention output is NaN/garbage, MoE amplifies it
- The MoE kernel may be fine but receives bad input
- [ ] **Test:** Hook into layer 0 forward, inspect hidden_states before MoE
`input_global_scale_inv` comes from the checkpoint `input_scale` field. For MoE, we override this with a warmup. For attention, there's **no warmup** — it uses the raw checkpoint value.
### H3: Weight loading mismatch between vLLM and test runner
- vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader)
- Test scripts load directly from safetensors
- The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs
- [ ] **Test:** Compare weights loaded by vLLM vs direct safetensors load
The `CompressedTensorsW4A4Fp4.process_weights_after_loading` sets:
```python
input_global_scale_inv = layer.input_scale.max().to(torch.float32) # = 0.00025141
layer.alpha = input_global_scale * layer.weight_global_scale
```
### H4: Expert routing / topk_ids mismatch
- vLLM uses global expert IDs, runner expects local expert IDs
- If routing is wrong, wrong experts process tokens
- [ ] **Test:** Log topk_ids in vLLM vs test, verify they match expected patterns
For q_a_proj: `input_scale = 0.00025141`, meaning `1/input_scale = 3977.6`. The activation quantization divides by 0.00025141 (multiplies by 3977.6). For typical activations with amax ~2-8, this produces values far beyond FP4 range (max 6.0), causing NaN via overflow.
### H5: Residual connection scale issue
- vLLM adds MoE output to residual: `hidden = residual + MoE(hidden)`
- If MoE output scale is wrong, residual connection can amplify error across layers
- [ ] **Test:** Run test_multilayer.py to check error accumulation
## Evidence
### H6: Input_scale from checkpoint is being used somewhere
- MEMORY.md says checkpoint input_scale is wrong
- The code comment says gs default is 1/2688, overridden by warmup
- But maybe finalize_weights sets it to checkpoint input_scale somewhere?
- [ ] **Test:** Verify which gs value is actually used at runtime
1. **MoE kernel is fine**`test_runner_vllm_style.py` with warmup gs gives cosine 0.988
2. **NaN from layer 0** — diagnostic prints show ALL layers from 0 produce NaN
3. **Attention weights dequantize fine**`test_attn_weights.py` shows no NaN from dequantized BF16 matmul
4. **The problem is in the NVFP4 activation quantization**, not the weights
### H7: DeepSeek V4 attention / RoPE bug
- The cos_sin_cache fix and float32 patch are applied
- But maybe attention still produces garbage for real token positions
- [ ] **Test:** Single-layer test with real token positions (not random)
## Fix
## Test Plan (ordered by ease and likelihood)
The attention `input_scale` needs the same warmup-based override we did for MoE, OR the `input_scale` values need to be validated/corrected.
1. **Quick: Run layertest.py on B200** — baseline, confirm kernel still works
2. **Standalone runner test with real-ish data** — use runner outside vLLM, check output
3. **Inspect gs values** — print the gs computed by warmup, compare with dynamic gs
4. **Multi-layer accumulation test** — test_multilayer.py
5. **Weight loading comparison** — dump vLLM loaded weights vs direct load
6. **Full pipeline test** — test_pipeline_real_weights.py with 48 experts
7. **Attention output inspection** — check hidden_states before MoE in vLLM
Options:
1. **Add warmup for attention `input_global_scale_inv`** — same pattern as MoE: run a dummy forward, capture actual activation amax, compute correct gs
2. **Dequantize attention weights to BF16** (like compressor weights) — avoids NVFP4 activation quantization entirely, at the cost of more memory
3. **Fix the checkpoint input_scale** — if the values are wrong, re-calibrate
Option 2 is the quickest path — dequantize attention NVFP4 weights to BF16 at load time (the `_dequant_nvfp4_to_bf16` method already exists). This trades memory for correctness.
## Progress
- [x] Removed NaN check (Dynamo incompatible)
- [x] vLLM container starts and loads model
- [x] Confirmed NaN logits from completions API
- [x] ~~H1: gs is wrong~~ — Warmup gs produces cosine 0.988 with BF16 ref. **gs is NOT the problem** when warmup is used.
- Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init
- **BUT:** Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize.
- [ ] **New lead:** MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path)
- [x] MoE kernel: cosine 0.988 with warmup gs NOT the problem
- [x] NaN starts at layer 0 — attention is the source
- [x] Root cause: attention NVFP4 `input_scale` from checkpoint produces NaN during activation quantization
- [ ] **Next: Fix attention NVFP4 path — dequant to BF16 or add warmup**

View File

@@ -1702,28 +1702,27 @@ class DeepseekV4Model(nn.Module):
def _convert_nvfp4_post_load(self):
"""Post-load conversion of NVFP4 weights for vLLM compatibility.
Only wo_a needs FP8 conversion (attention forward uses fp8_einsum
which requires FP8 inputs). All other NVFP4 weights stay native —
vLLM's FlashInferCutlassNvFp4LinearKernel handles them directly.
All attention NVFP4 projections are dequantized to BF16 because
the checkpoint input_scale values cause NaN during activation
quantization in FlashInferCutlassNvFp4LinearKernel. BF16 bypasses
the broken input_scale entirely.
Compressor weights are reconstructed from checkpoint sub-weights
because the stacking weight_loader corrupts NVFP4 uint8 data.
"""
FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
# Only wo_a needs conversion — fp8_einsum requires FP8 weight + scale
fp8_proj_names = {"wo_a"}
fp8_converted = 0
# All attention projections to dequantize to BF16
bf16_proj_names = {"wq_a", "wq_b", "wkv", "wo_a", "wo_b"}
bf16_converted = 0
compressor_converted = 0
_shard_index = self._build_shard_index("/model") if os.path.isdir("/model") else None
from tqdm import tqdm
for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc=" (upcast)NVFP4→FP8 wo_a only", unit="layer"):
for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc=" (upcast)NVFP4→BF16 attn projs", unit="layer"):
attn = layer.attn
# FP8 conversion: only wo_a
for proj_name in fp8_proj_names:
# BF16 dequantization: all attention projections
for proj_name in bf16_proj_names:
if not hasattr(attn, proj_name):
continue
mod = getattr(attn, proj_name)
@@ -1731,8 +1730,8 @@ class DeepseekV4Model(nn.Module):
continue
if mod.weight.dtype in (torch.uint8, torch.int8):
E2M1_LUT = torch.tensor([0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16)
self._convert_nvfp4_to_fp8(mod, E2M1_LUT, FP8_MAX)
fp8_converted += 1
self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
bf16_converted += 1
# Compressor: still needs BF16 reconstruction
mla_attn = getattr(attn, "mla_attn", None)