Fix: dequantize ALL attention NVFP4 projections to BF16

Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup).
2026-05-18 13:09:36 +00:00
parent a83c332059
commit 334e95047e
2 changed files with 46 additions and 71 deletions
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,79 +1,55 @@
-# Current Bug: vLLM produces empty/garbage output with NaN logits
+# Current Bug: vLLM produces NaN from layer 0

-**Status:** Active debugging
+**Status:** ROOT CAUSE IDENTIFIED
 **Date:** 2026-05-18

 ## Symptom
- vLLM server starts successfully, loads model, captures cudagraph
- Chat completions return `content: ""` with `finish_reason: "length"`
- Raw completions API returns: `Out of range float values are not JSON compliant: nan`
- 50 completion tokens generated but all produce NaN logits
- Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8)
+- vLLM server starts, loads model, but every inference produces NaN logits
+- Diagnostic prints show **NaN from layer 0 onward** — no layer ever produces valid output

-## Known Good
- `layertest.py` passes (cosine 0.988 with BF16 reference) — MoE kernel math is correct
- `cudagraph_test.py` passes — no CPU-GPU syncs, capture + replay works
- Model weights load successfully (281K tensors)
- Kernel compiles and runs without CUDA errors
+## Root Cause

-## Hypotheses
+**The attention NVFP4 linear layers produce NaN immediately.**

-### H1: Activation global scale (gs) is wrong
- `compute_activation_global_scales` is called during init with **random data** (torch.randn)
- Random data may produce gs values that don't represent real token distributions
- If gs is too small: activation quantization clips, info loss
- If gs is too large: quantization noise dominates
- [ ] **Test:** Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs
- [ ] **Test:** Run runner on real token data outside vLLM, check for NaN/garbage
+The attention projections (`q_a_proj`, `q_b_proj`, `kv_proj`, `o_a_proj`, `o_b_proj`) go through vLLM's `FlashInferCutlassNvFp4LinearKernel` which calls:
+```python
+x_fp4, x_blockscale = scaled_fp4_quant(x, layer.input_global_scale_inv, ...)
+```

-### H2: Attention layer produces bad hidden states before MoE
- If attention output is NaN/garbage, MoE amplifies it
- The MoE kernel may be fine but receives bad input
- [ ] **Test:** Hook into layer 0 forward, inspect hidden_states before MoE
+`input_global_scale_inv` comes from the checkpoint `input_scale` field. For MoE, we override this with a warmup. For attention, there's **no warmup** — it uses the raw checkpoint value.

-### H3: Weight loading mismatch between vLLM and test runner
- vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader)
- Test scripts load directly from safetensors
- The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs
- [ ] **Test:** Compare weights loaded by vLLM vs direct safetensors load
+The `CompressedTensorsW4A4Fp4.process_weights_after_loading` sets:
+```python
+input_global_scale_inv = layer.input_scale.max().to(torch.float32)  # = 0.00025141
+layer.alpha = input_global_scale * layer.weight_global_scale
+```

-### H4: Expert routing / topk_ids mismatch
- vLLM uses global expert IDs, runner expects local expert IDs
- If routing is wrong, wrong experts process tokens
- [ ] **Test:** Log topk_ids in vLLM vs test, verify they match expected patterns
+For q_a_proj: `input_scale = 0.00025141`, meaning `1/input_scale = 3977.6`. The activation quantization divides by 0.00025141 (multiplies by 3977.6). For typical activations with amax ~2-8, this produces values far beyond FP4 range (max 6.0), causing NaN via overflow.

-### H5: Residual connection scale issue
- vLLM adds MoE output to residual: `hidden = residual + MoE(hidden)`
- If MoE output scale is wrong, residual connection can amplify error across layers
- [ ] **Test:** Run test_multilayer.py to check error accumulation
+## Evidence

-### H6: Input_scale from checkpoint is being used somewhere
- MEMORY.md says checkpoint input_scale is wrong
- The code comment says gs default is 1/2688, overridden by warmup
- But maybe finalize_weights sets it to checkpoint input_scale somewhere?
- [ ] **Test:** Verify which gs value is actually used at runtime
+1. **MoE kernel is fine** — `test_runner_vllm_style.py` with warmup gs gives cosine 0.988
+2. **NaN from layer 0** — diagnostic prints show ALL layers from 0 produce NaN
+3. **Attention weights dequantize fine** — `test_attn_weights.py` shows no NaN from dequantized BF16 matmul
+4. **The problem is in the NVFP4 activation quantization**, not the weights

-### H7: DeepSeek V4 attention / RoPE bug
- The cos_sin_cache fix and float32 patch are applied
- But maybe attention still produces garbage for real token positions
- [ ] **Test:** Single-layer test with real token positions (not random)
+## Fix

-## Test Plan (ordered by ease and likelihood)
+The attention `input_scale` needs the same warmup-based override we did for MoE, OR the `input_scale` values need to be validated/corrected.

-1. **Quick: Run layertest.py on B200** — baseline, confirm kernel still works
-2. **Standalone runner test with real-ish data** — use runner outside vLLM, check output
-3. **Inspect gs values** — print the gs computed by warmup, compare with dynamic gs
-4. **Multi-layer accumulation test** — test_multilayer.py
-5. **Weight loading comparison** — dump vLLM loaded weights vs direct load
-6. **Full pipeline test** — test_pipeline_real_weights.py with 48 experts
-7. **Attention output inspection** — check hidden_states before MoE in vLLM
+Options:
+1. **Add warmup for attention `input_global_scale_inv`** — same pattern as MoE: run a dummy forward, capture actual activation amax, compute correct gs
+2. **Dequantize attention weights to BF16** (like compressor weights) — avoids NVFP4 activation quantization entirely, at the cost of more memory
+3. **Fix the checkpoint input_scale** — if the values are wrong, re-calibrate
+
+Option 2 is the quickest path — dequantize attention NVFP4 weights to BF16 at load time (the `_dequant_nvfp4_to_bf16` method already exists). This trades memory for correctness.

 ## Progress

 - [x] Removed NaN check (Dynamo incompatible)
 - [x] vLLM container starts and loads model
 - [x] Confirmed NaN logits from completions API
- [x] ~~H1: gs is wrong~~ — Warmup gs produces cosine 0.988 with BF16 ref. **gs is NOT the problem** when warmup is used.
-  - Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init
-  - **BUT:** Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize.
- [ ] **New lead:** MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path)
+- [x] MoE kernel: cosine 0.988 with warmup gs — NOT the problem
+- [x] NaN starts at layer 0 — attention is the source
+- [x] Root cause: attention NVFP4 `input_scale` from checkpoint produces NaN during activation quantization
+- [ ] **Next: Fix attention NVFP4 path — dequant to BF16 or add warmup**
--- a/vllm/patches/deepseek_v4.py
+++ b/vllm/patches/deepseek_v4.py
@@ -1702,28 +1702,27 @@ class DeepseekV4Model(nn.Module):
    def _convert_nvfp4_post_load(self):
        """Post-load conversion of NVFP4 weights for vLLM compatibility.
        
-        Only wo_a needs FP8 conversion (attention forward uses fp8_einsum
-        which requires FP8 inputs). All other NVFP4 weights stay native —
-        vLLM's FlashInferCutlassNvFp4LinearKernel handles them directly.
+        All attention NVFP4 projections are dequantized to BF16 because
+        the checkpoint input_scale values cause NaN during activation
+        quantization in FlashInferCutlassNvFp4LinearKernel. BF16 bypasses
+        the broken input_scale entirely.
        
        Compressor weights are reconstructed from checkpoint sub-weights
        because the stacking weight_loader corrupts NVFP4 uint8 data.
        """
-        FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
-        
-        # Only wo_a needs conversion — fp8_einsum requires FP8 weight + scale
-        fp8_proj_names = {"wo_a"}
-        fp8_converted = 0
+        # All attention projections to dequantize to BF16
+        bf16_proj_names = {"wq_a", "wq_b", "wkv", "wo_a", "wo_b"}
+        bf16_converted = 0
        compressor_converted = 0

        _shard_index = self._build_shard_index("/model") if os.path.isdir("/model") else None

        from tqdm import tqdm
-        for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc="  (upcast)NVFP4→FP8 wo_a only", unit="layer"):
+        for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc="  (upcast)NVFP4→BF16 attn projs", unit="layer"):
            attn = layer.attn
            
-            # FP8 conversion: only wo_a
-            for proj_name in fp8_proj_names:
+            # BF16 dequantization: all attention projections
+            for proj_name in bf16_proj_names:
                if not hasattr(attn, proj_name):
                    continue
                mod = getattr(attn, proj_name)
@@ -1731,8 +1730,8 @@ class DeepseekV4Model(nn.Module):
                    continue
                if mod.weight.dtype in (torch.uint8, torch.int8):
                    E2M1_LUT = torch.tensor([0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16)
-                    self._convert_nvfp4_to_fp8(mod, E2M1_LUT, FP8_MAX)
-                    fp8_converted += 1
+                    self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
+                    bf16_converted += 1
            
            # Compressor: still needs BF16 reconstruction
            mla_attn = getattr(attn, "mla_attn", None)