diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md index 815e687d..eb1bd914 100644 --- a/CURRENT_BUG.md +++ b/CURRENT_BUG.md @@ -1,79 +1,55 @@ -# Current Bug: vLLM produces empty/garbage output with NaN logits +# Current Bug: vLLM produces NaN from layer 0 -**Status:** Active debugging +**Status:** ROOT CAUSE IDENTIFIED **Date:** 2026-05-18 ## Symptom -- vLLM server starts successfully, loads model, captures cudagraph -- Chat completions return `content: ""` with `finish_reason: "length"` -- Raw completions API returns: `Out of range float values are not JSON compliant: nan` -- 50 completion tokens generated but all produce NaN logits -- Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8) +- vLLM server starts, loads model, but every inference produces NaN logits +- Diagnostic prints show **NaN from layer 0 onward** — no layer ever produces valid output -## Known Good -- `layertest.py` passes (cosine 0.988 with BF16 reference) — MoE kernel math is correct -- `cudagraph_test.py` passes — no CPU-GPU syncs, capture + replay works -- Model weights load successfully (281K tensors) -- Kernel compiles and runs without CUDA errors +## Root Cause -## Hypotheses +**The attention NVFP4 linear layers produce NaN immediately.** -### H1: Activation global scale (gs) is wrong -- `compute_activation_global_scales` is called during init with **random data** (torch.randn) -- Random data may produce gs values that don't represent real token distributions -- If gs is too small: activation quantization clips, info loss -- If gs is too large: quantization noise dominates -- [ ] **Test:** Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs -- [ ] **Test:** Run runner on real token data outside vLLM, check for NaN/garbage +The attention projections (`q_a_proj`, `q_b_proj`, `kv_proj`, `o_a_proj`, `o_b_proj`) go through vLLM's `FlashInferCutlassNvFp4LinearKernel` which calls: +```python +x_fp4, x_blockscale = scaled_fp4_quant(x, layer.input_global_scale_inv, ...) +``` -### H2: Attention layer produces bad hidden states before MoE -- If attention output is NaN/garbage, MoE amplifies it -- The MoE kernel may be fine but receives bad input -- [ ] **Test:** Hook into layer 0 forward, inspect hidden_states before MoE +`input_global_scale_inv` comes from the checkpoint `input_scale` field. For MoE, we override this with a warmup. For attention, there's **no warmup** — it uses the raw checkpoint value. -### H3: Weight loading mismatch between vLLM and test runner -- vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader) -- Test scripts load directly from safetensors -- The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs -- [ ] **Test:** Compare weights loaded by vLLM vs direct safetensors load +The `CompressedTensorsW4A4Fp4.process_weights_after_loading` sets: +```python +input_global_scale_inv = layer.input_scale.max().to(torch.float32) # = 0.00025141 +layer.alpha = input_global_scale * layer.weight_global_scale +``` -### H4: Expert routing / topk_ids mismatch -- vLLM uses global expert IDs, runner expects local expert IDs -- If routing is wrong, wrong experts process tokens -- [ ] **Test:** Log topk_ids in vLLM vs test, verify they match expected patterns +For q_a_proj: `input_scale = 0.00025141`, meaning `1/input_scale = 3977.6`. The activation quantization divides by 0.00025141 (multiplies by 3977.6). For typical activations with amax ~2-8, this produces values far beyond FP4 range (max 6.0), causing NaN via overflow. -### H5: Residual connection scale issue -- vLLM adds MoE output to residual: `hidden = residual + MoE(hidden)` -- If MoE output scale is wrong, residual connection can amplify error across layers -- [ ] **Test:** Run test_multilayer.py to check error accumulation +## Evidence -### H6: Input_scale from checkpoint is being used somewhere -- MEMORY.md says checkpoint input_scale is wrong -- The code comment says gs default is 1/2688, overridden by warmup -- But maybe finalize_weights sets it to checkpoint input_scale somewhere? -- [ ] **Test:** Verify which gs value is actually used at runtime +1. **MoE kernel is fine** — `test_runner_vllm_style.py` with warmup gs gives cosine 0.988 +2. **NaN from layer 0** — diagnostic prints show ALL layers from 0 produce NaN +3. **Attention weights dequantize fine** — `test_attn_weights.py` shows no NaN from dequantized BF16 matmul +4. **The problem is in the NVFP4 activation quantization**, not the weights -### H7: DeepSeek V4 attention / RoPE bug -- The cos_sin_cache fix and float32 patch are applied -- But maybe attention still produces garbage for real token positions -- [ ] **Test:** Single-layer test with real token positions (not random) +## Fix -## Test Plan (ordered by ease and likelihood) +The attention `input_scale` needs the same warmup-based override we did for MoE, OR the `input_scale` values need to be validated/corrected. -1. **Quick: Run layertest.py on B200** — baseline, confirm kernel still works -2. **Standalone runner test with real-ish data** — use runner outside vLLM, check output -3. **Inspect gs values** — print the gs computed by warmup, compare with dynamic gs -4. **Multi-layer accumulation test** — test_multilayer.py -5. **Weight loading comparison** — dump vLLM loaded weights vs direct load -6. **Full pipeline test** — test_pipeline_real_weights.py with 48 experts -7. **Attention output inspection** — check hidden_states before MoE in vLLM +Options: +1. **Add warmup for attention `input_global_scale_inv`** — same pattern as MoE: run a dummy forward, capture actual activation amax, compute correct gs +2. **Dequantize attention weights to BF16** (like compressor weights) — avoids NVFP4 activation quantization entirely, at the cost of more memory +3. **Fix the checkpoint input_scale** — if the values are wrong, re-calibrate + +Option 2 is the quickest path — dequantize attention NVFP4 weights to BF16 at load time (the `_dequant_nvfp4_to_bf16` method already exists). This trades memory for correctness. ## Progress - [x] Removed NaN check (Dynamo incompatible) - [x] vLLM container starts and loads model - [x] Confirmed NaN logits from completions API -- [x] ~~H1: gs is wrong~~ — Warmup gs produces cosine 0.988 with BF16 ref. **gs is NOT the problem** when warmup is used. - - Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init - - **BUT:** Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize. -- [ ] **New lead:** MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path) +- [x] MoE kernel: cosine 0.988 with warmup gs — NOT the problem +- [x] NaN starts at layer 0 — attention is the source +- [x] Root cause: attention NVFP4 `input_scale` from checkpoint produces NaN during activation quantization +- [ ] **Next: Fix attention NVFP4 path — dequant to BF16 or add warmup** diff --git a/vllm/patches/deepseek_v4.py b/vllm/patches/deepseek_v4.py index 516fd6e6..2e88e35a 100644 --- a/vllm/patches/deepseek_v4.py +++ b/vllm/patches/deepseek_v4.py @@ -1702,28 +1702,27 @@ class DeepseekV4Model(nn.Module): def _convert_nvfp4_post_load(self): """Post-load conversion of NVFP4 weights for vLLM compatibility. - Only wo_a needs FP8 conversion (attention forward uses fp8_einsum - which requires FP8 inputs). All other NVFP4 weights stay native — - vLLM's FlashInferCutlassNvFp4LinearKernel handles them directly. + All attention NVFP4 projections are dequantized to BF16 because + the checkpoint input_scale values cause NaN during activation + quantization in FlashInferCutlassNvFp4LinearKernel. BF16 bypasses + the broken input_scale entirely. Compressor weights are reconstructed from checkpoint sub-weights because the stacking weight_loader corrupts NVFP4 uint8 data. """ - FP8_MAX = torch.finfo(torch.float8_e4m3fn).max - - # Only wo_a needs conversion — fp8_einsum requires FP8 weight + scale - fp8_proj_names = {"wo_a"} - fp8_converted = 0 + # All attention projections to dequantize to BF16 + bf16_proj_names = {"wq_a", "wq_b", "wkv", "wo_a", "wo_b"} + bf16_converted = 0 compressor_converted = 0 _shard_index = self._build_shard_index("/model") if os.path.isdir("/model") else None from tqdm import tqdm - for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc=" (upcast)NVFP4→FP8 wo_a only", unit="layer"): + for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc=" (upcast)NVFP4→BF16 attn projs", unit="layer"): attn = layer.attn - # FP8 conversion: only wo_a - for proj_name in fp8_proj_names: + # BF16 dequantization: all attention projections + for proj_name in bf16_proj_names: if not hasattr(attn, proj_name): continue mod = getattr(attn, proj_name) @@ -1731,8 +1730,8 @@ class DeepseekV4Model(nn.Module): continue if mod.weight.dtype in (torch.uint8, torch.int8): E2M1_LUT = torch.tensor([0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16) - self._convert_nvfp4_to_fp8(mod, E2M1_LUT, FP8_MAX) - fp8_converted += 1 + self._dequant_nvfp4_to_bf16(mod, E2M1_LUT) + bf16_converted += 1 # Compressor: still needs BF16 reconstruction mla_attn = getattr(attn, "mla_attn", None)