Fix: dequantize ALL attention NVFP4 projections to BF16
Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup).
This commit is contained in:
@@ -1,79 +1,55 @@
|
||||
# Current Bug: vLLM produces empty/garbage output with NaN logits
|
||||
# Current Bug: vLLM produces NaN from layer 0
|
||||
|
||||
**Status:** Active debugging
|
||||
**Status:** ROOT CAUSE IDENTIFIED
|
||||
**Date:** 2026-05-18
|
||||
|
||||
## Symptom
|
||||
- vLLM server starts successfully, loads model, captures cudagraph
|
||||
- Chat completions return `content: ""` with `finish_reason: "length"`
|
||||
- Raw completions API returns: `Out of range float values are not JSON compliant: nan`
|
||||
- 50 completion tokens generated but all produce NaN logits
|
||||
- Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8)
|
||||
- vLLM server starts, loads model, but every inference produces NaN logits
|
||||
- Diagnostic prints show **NaN from layer 0 onward** — no layer ever produces valid output
|
||||
|
||||
## Known Good
|
||||
- `layertest.py` passes (cosine 0.988 with BF16 reference) — MoE kernel math is correct
|
||||
- `cudagraph_test.py` passes — no CPU-GPU syncs, capture + replay works
|
||||
- Model weights load successfully (281K tensors)
|
||||
- Kernel compiles and runs without CUDA errors
|
||||
## Root Cause
|
||||
|
||||
## Hypotheses
|
||||
**The attention NVFP4 linear layers produce NaN immediately.**
|
||||
|
||||
### H1: Activation global scale (gs) is wrong
|
||||
- `compute_activation_global_scales` is called during init with **random data** (torch.randn)
|
||||
- Random data may produce gs values that don't represent real token distributions
|
||||
- If gs is too small: activation quantization clips, info loss
|
||||
- If gs is too large: quantization noise dominates
|
||||
- [ ] **Test:** Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs
|
||||
- [ ] **Test:** Run runner on real token data outside vLLM, check for NaN/garbage
|
||||
The attention projections (`q_a_proj`, `q_b_proj`, `kv_proj`, `o_a_proj`, `o_b_proj`) go through vLLM's `FlashInferCutlassNvFp4LinearKernel` which calls:
|
||||
```python
|
||||
x_fp4, x_blockscale = scaled_fp4_quant(x, layer.input_global_scale_inv, ...)
|
||||
```
|
||||
|
||||
### H2: Attention layer produces bad hidden states before MoE
|
||||
- If attention output is NaN/garbage, MoE amplifies it
|
||||
- The MoE kernel may be fine but receives bad input
|
||||
- [ ] **Test:** Hook into layer 0 forward, inspect hidden_states before MoE
|
||||
`input_global_scale_inv` comes from the checkpoint `input_scale` field. For MoE, we override this with a warmup. For attention, there's **no warmup** — it uses the raw checkpoint value.
|
||||
|
||||
### H3: Weight loading mismatch between vLLM and test runner
|
||||
- vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader)
|
||||
- Test scripts load directly from safetensors
|
||||
- The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs
|
||||
- [ ] **Test:** Compare weights loaded by vLLM vs direct safetensors load
|
||||
The `CompressedTensorsW4A4Fp4.process_weights_after_loading` sets:
|
||||
```python
|
||||
input_global_scale_inv = layer.input_scale.max().to(torch.float32) # = 0.00025141
|
||||
layer.alpha = input_global_scale * layer.weight_global_scale
|
||||
```
|
||||
|
||||
### H4: Expert routing / topk_ids mismatch
|
||||
- vLLM uses global expert IDs, runner expects local expert IDs
|
||||
- If routing is wrong, wrong experts process tokens
|
||||
- [ ] **Test:** Log topk_ids in vLLM vs test, verify they match expected patterns
|
||||
For q_a_proj: `input_scale = 0.00025141`, meaning `1/input_scale = 3977.6`. The activation quantization divides by 0.00025141 (multiplies by 3977.6). For typical activations with amax ~2-8, this produces values far beyond FP4 range (max 6.0), causing NaN via overflow.
|
||||
|
||||
### H5: Residual connection scale issue
|
||||
- vLLM adds MoE output to residual: `hidden = residual + MoE(hidden)`
|
||||
- If MoE output scale is wrong, residual connection can amplify error across layers
|
||||
- [ ] **Test:** Run test_multilayer.py to check error accumulation
|
||||
## Evidence
|
||||
|
||||
### H6: Input_scale from checkpoint is being used somewhere
|
||||
- MEMORY.md says checkpoint input_scale is wrong
|
||||
- The code comment says gs default is 1/2688, overridden by warmup
|
||||
- But maybe finalize_weights sets it to checkpoint input_scale somewhere?
|
||||
- [ ] **Test:** Verify which gs value is actually used at runtime
|
||||
1. **MoE kernel is fine** — `test_runner_vllm_style.py` with warmup gs gives cosine 0.988
|
||||
2. **NaN from layer 0** — diagnostic prints show ALL layers from 0 produce NaN
|
||||
3. **Attention weights dequantize fine** — `test_attn_weights.py` shows no NaN from dequantized BF16 matmul
|
||||
4. **The problem is in the NVFP4 activation quantization**, not the weights
|
||||
|
||||
### H7: DeepSeek V4 attention / RoPE bug
|
||||
- The cos_sin_cache fix and float32 patch are applied
|
||||
- But maybe attention still produces garbage for real token positions
|
||||
- [ ] **Test:** Single-layer test with real token positions (not random)
|
||||
## Fix
|
||||
|
||||
## Test Plan (ordered by ease and likelihood)
|
||||
The attention `input_scale` needs the same warmup-based override we did for MoE, OR the `input_scale` values need to be validated/corrected.
|
||||
|
||||
1. **Quick: Run layertest.py on B200** — baseline, confirm kernel still works
|
||||
2. **Standalone runner test with real-ish data** — use runner outside vLLM, check output
|
||||
3. **Inspect gs values** — print the gs computed by warmup, compare with dynamic gs
|
||||
4. **Multi-layer accumulation test** — test_multilayer.py
|
||||
5. **Weight loading comparison** — dump vLLM loaded weights vs direct load
|
||||
6. **Full pipeline test** — test_pipeline_real_weights.py with 48 experts
|
||||
7. **Attention output inspection** — check hidden_states before MoE in vLLM
|
||||
Options:
|
||||
1. **Add warmup for attention `input_global_scale_inv`** — same pattern as MoE: run a dummy forward, capture actual activation amax, compute correct gs
|
||||
2. **Dequantize attention weights to BF16** (like compressor weights) — avoids NVFP4 activation quantization entirely, at the cost of more memory
|
||||
3. **Fix the checkpoint input_scale** — if the values are wrong, re-calibrate
|
||||
|
||||
Option 2 is the quickest path — dequantize attention NVFP4 weights to BF16 at load time (the `_dequant_nvfp4_to_bf16` method already exists). This trades memory for correctness.
|
||||
|
||||
## Progress
|
||||
|
||||
- [x] Removed NaN check (Dynamo incompatible)
|
||||
- [x] vLLM container starts and loads model
|
||||
- [x] Confirmed NaN logits from completions API
|
||||
- [x] ~~H1: gs is wrong~~ — Warmup gs produces cosine 0.988 with BF16 ref. **gs is NOT the problem** when warmup is used.
|
||||
- Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init
|
||||
- **BUT:** Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize.
|
||||
- [ ] **New lead:** MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path)
|
||||
- [x] MoE kernel: cosine 0.988 with warmup gs — NOT the problem
|
||||
- [x] NaN starts at layer 0 — attention is the source
|
||||
- [x] Root cause: attention NVFP4 `input_scale` from checkpoint produces NaN during activation quantization
|
||||
- [ ] **Next: Fix attention NVFP4 path — dequant to BF16 or add warmup**
|
||||
|
||||
@@ -1702,28 +1702,27 @@ class DeepseekV4Model(nn.Module):
|
||||
def _convert_nvfp4_post_load(self):
|
||||
"""Post-load conversion of NVFP4 weights for vLLM compatibility.
|
||||
|
||||
Only wo_a needs FP8 conversion (attention forward uses fp8_einsum
|
||||
which requires FP8 inputs). All other NVFP4 weights stay native —
|
||||
vLLM's FlashInferCutlassNvFp4LinearKernel handles them directly.
|
||||
All attention NVFP4 projections are dequantized to BF16 because
|
||||
the checkpoint input_scale values cause NaN during activation
|
||||
quantization in FlashInferCutlassNvFp4LinearKernel. BF16 bypasses
|
||||
the broken input_scale entirely.
|
||||
|
||||
Compressor weights are reconstructed from checkpoint sub-weights
|
||||
because the stacking weight_loader corrupts NVFP4 uint8 data.
|
||||
"""
|
||||
FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
|
||||
|
||||
# Only wo_a needs conversion — fp8_einsum requires FP8 weight + scale
|
||||
fp8_proj_names = {"wo_a"}
|
||||
fp8_converted = 0
|
||||
# All attention projections to dequantize to BF16
|
||||
bf16_proj_names = {"wq_a", "wq_b", "wkv", "wo_a", "wo_b"}
|
||||
bf16_converted = 0
|
||||
compressor_converted = 0
|
||||
|
||||
_shard_index = self._build_shard_index("/model") if os.path.isdir("/model") else None
|
||||
|
||||
from tqdm import tqdm
|
||||
for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc=" (upcast)NVFP4→FP8 wo_a only", unit="layer"):
|
||||
for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc=" (upcast)NVFP4→BF16 attn projs", unit="layer"):
|
||||
attn = layer.attn
|
||||
|
||||
# FP8 conversion: only wo_a
|
||||
for proj_name in fp8_proj_names:
|
||||
# BF16 dequantization: all attention projections
|
||||
for proj_name in bf16_proj_names:
|
||||
if not hasattr(attn, proj_name):
|
||||
continue
|
||||
mod = getattr(attn, proj_name)
|
||||
@@ -1731,8 +1730,8 @@ class DeepseekV4Model(nn.Module):
|
||||
continue
|
||||
if mod.weight.dtype in (torch.uint8, torch.int8):
|
||||
E2M1_LUT = torch.tensor([0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16)
|
||||
self._convert_nvfp4_to_fp8(mod, E2M1_LUT, FP8_MAX)
|
||||
fp8_converted += 1
|
||||
self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
|
||||
bf16_converted += 1
|
||||
|
||||
# Compressor: still needs BF16 reconstruction
|
||||
mla_attn = getattr(attn, "mla_attn", None)
|
||||
|
||||
Reference in New Issue
Block a user