diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md
index 815e687d..eb1bd914 100644
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,79 +1,55 @@
-# Current Bug: vLLM produces empty/garbage output with NaN logits
+# Current Bug: vLLM produces NaN from layer 0
 
-**Status:** Active debugging
+**Status:** ROOT CAUSE IDENTIFIED
 **Date:** 2026-05-18
 
 ## Symptom
-- vLLM server starts successfully, loads model, captures cudagraph
-- Chat completions return `content: ""` with `finish_reason: "length"`
-- Raw completions API returns: `Out of range float values are not JSON compliant: nan`
-- 50 completion tokens generated but all produce NaN logits
-- Model: DeepSeek-V4-Pro-NVFP4 on 8x B200 (TP=8)
+- vLLM server starts, loads model, but every inference produces NaN logits
+- Diagnostic prints show **NaN from layer 0 onward** — no layer ever produces valid output
 
-## Known Good
-- `layertest.py` passes (cosine 0.988 with BF16 reference) — MoE kernel math is correct
-- `cudagraph_test.py` passes — no CPU-GPU syncs, capture + replay works
-- Model weights load successfully (281K tensors)
-- Kernel compiles and runs without CUDA errors
+## Root Cause
 
-## Hypotheses
+**The attention NVFP4 linear layers produce NaN immediately.**
 
-### H1: Activation global scale (gs) is wrong
-- `compute_activation_global_scales` is called during init with **random data** (torch.randn)
-- Random data may produce gs values that don't represent real token distributions
-- If gs is too small: activation quantization clips, info loss
-- If gs is too large: quantization noise dominates
-- [ ] **Test:** Run layertest with the exact gs the vLLM init computes, compare vs dynamic gs
-- [ ] **Test:** Run runner on real token data outside vLLM, check for NaN/garbage
+The attention projections (`q_a_proj`, `q_b_proj`, `kv_proj`, `o_a_proj`, `o_b_proj`) go through vLLM's `FlashInferCutlassNvFp4LinearKernel` which calls:
+```python
+x_fp4, x_blockscale = scaled_fp4_quant(x, layer.input_global_scale_inv, ...)
+```
 
-### H2: Attention layer produces bad hidden states before MoE
-- If attention output is NaN/garbage, MoE amplifies it
-- The MoE kernel may be fine but receives bad input
-- [ ] **Test:** Hook into layer 0 forward, inspect hidden_states before MoE
+`input_global_scale_inv` comes from the checkpoint `input_scale` field. For MoE, we override this with a warmup. For attention, there's **no warmup** — it uses the raw checkpoint value.
 
-### H3: Weight loading mismatch between vLLM and test runner
-- vLLM loads weights via its own pipeline (DeepseekV4ForCausalLM weight_loader)
-- Test scripts load directly from safetensors
-- The weight loading patches (model. prefix strip, CKPT_KEY_SUBST) may have bugs
-- [ ] **Test:** Compare weights loaded by vLLM vs direct safetensors load
+The `CompressedTensorsW4A4Fp4.process_weights_after_loading` sets:
+```python
+input_global_scale_inv = layer.input_scale.max().to(torch.float32)  # = 0.00025141
+layer.alpha = input_global_scale * layer.weight_global_scale
+```
 
-### H4: Expert routing / topk_ids mismatch
-- vLLM uses global expert IDs, runner expects local expert IDs
-- If routing is wrong, wrong experts process tokens
-- [ ] **Test:** Log topk_ids in vLLM vs test, verify they match expected patterns
+For q_a_proj: `input_scale = 0.00025141`, meaning `1/input_scale = 3977.6`. The activation quantization divides by 0.00025141 (multiplies by 3977.6). For typical activations with amax ~2-8, this produces values far beyond FP4 range (max 6.0), causing NaN via overflow.
 
-### H5: Residual connection scale issue
-- vLLM adds MoE output to residual: `hidden = residual + MoE(hidden)`
-- If MoE output scale is wrong, residual connection can amplify error across layers
-- [ ] **Test:** Run test_multilayer.py to check error accumulation
+## Evidence
 
-### H6: Input_scale from checkpoint is being used somewhere
-- MEMORY.md says checkpoint input_scale is wrong
-- The code comment says gs default is 1/2688, overridden by warmup
-- But maybe finalize_weights sets it to checkpoint input_scale somewhere?
-- [ ] **Test:** Verify which gs value is actually used at runtime
+1. **MoE kernel is fine** — `test_runner_vllm_style.py` with warmup gs gives cosine 0.988
+2. **NaN from layer 0** — diagnostic prints show ALL layers from 0 produce NaN
+3. **Attention weights dequantize fine** — `test_attn_weights.py` shows no NaN from dequantized BF16 matmul
+4. **The problem is in the NVFP4 activation quantization**, not the weights
 
-### H7: DeepSeek V4 attention / RoPE bug
-- The cos_sin_cache fix and float32 patch are applied
-- But maybe attention still produces garbage for real token positions
-- [ ] **Test:** Single-layer test with real token positions (not random)
+## Fix
 
-## Test Plan (ordered by ease and likelihood)
+The attention `input_scale` needs the same warmup-based override we did for MoE, OR the `input_scale` values need to be validated/corrected.
 
-1. **Quick: Run layertest.py on B200** — baseline, confirm kernel still works
-2. **Standalone runner test with real-ish data** — use runner outside vLLM, check output
-3. **Inspect gs values** — print the gs computed by warmup, compare with dynamic gs
-4. **Multi-layer accumulation test** — test_multilayer.py
-5. **Weight loading comparison** — dump vLLM loaded weights vs direct load
-6. **Full pipeline test** — test_pipeline_real_weights.py with 48 experts
-7. **Attention output inspection** — check hidden_states before MoE in vLLM
+Options:
+1. **Add warmup for attention `input_global_scale_inv`** — same pattern as MoE: run a dummy forward, capture actual activation amax, compute correct gs
+2. **Dequantize attention weights to BF16** (like compressor weights) — avoids NVFP4 activation quantization entirely, at the cost of more memory
+3. **Fix the checkpoint input_scale** — if the values are wrong, re-calibrate
+
+Option 2 is the quickest path — dequantize attention NVFP4 weights to BF16 at load time (the `_dequant_nvfp4_to_bf16` method already exists). This trades memory for correctness.
 
 ## Progress
 
 - [x] Removed NaN check (Dynamo incompatible)
 - [x] vLLM container starts and loads model
 - [x] Confirmed NaN logits from completions API
-- [x] ~~H1: gs is wrong~~ — Warmup gs produces cosine 0.988 with BF16 ref. **gs is NOT the problem** when warmup is used.
-  - Default gs (1/2688) gives cosine 0.621, but vLLM calls warmup during init
-  - **BUT:** Does vLLM actually call warmup before every forward, or just once? If gs is computed from random data once and never updated, it may not generalize.
-- [ ] **New lead:** MoE kernel is fine, problem is upstream (attention, embeddings, or weight loading in vLLM path)
+- [x] MoE kernel: cosine 0.988 with warmup gs — NOT the problem
+- [x] NaN starts at layer 0 — attention is the source
+- [x] Root cause: attention NVFP4 `input_scale` from checkpoint produces NaN during activation quantization
+- [ ] **Next: Fix attention NVFP4 path — dequant to BF16 or add warmup**
diff --git a/vllm/patches/deepseek_v4.py b/vllm/patches/deepseek_v4.py
index 516fd6e6..2e88e35a 100644
--- a/vllm/patches/deepseek_v4.py
+++ b/vllm/patches/deepseek_v4.py
@@ -1702,28 +1702,27 @@ class DeepseekV4Model(nn.Module):
     def _convert_nvfp4_post_load(self):
         """Post-load conversion of NVFP4 weights for vLLM compatibility.
         
-        Only wo_a needs FP8 conversion (attention forward uses fp8_einsum
-        which requires FP8 inputs). All other NVFP4 weights stay native —
-        vLLM's FlashInferCutlassNvFp4LinearKernel handles them directly.
+        All attention NVFP4 projections are dequantized to BF16 because
+        the checkpoint input_scale values cause NaN during activation
+        quantization in FlashInferCutlassNvFp4LinearKernel. BF16 bypasses
+        the broken input_scale entirely.
         
         Compressor weights are reconstructed from checkpoint sub-weights
         because the stacking weight_loader corrupts NVFP4 uint8 data.
         """
-        FP8_MAX = torch.finfo(torch.float8_e4m3fn).max
-        
-        # Only wo_a needs conversion — fp8_einsum requires FP8 weight + scale
-        fp8_proj_names = {"wo_a"}
-        fp8_converted = 0
+        # All attention projections to dequantize to BF16
+        bf16_proj_names = {"wq_a", "wq_b", "wkv", "wo_a", "wo_b"}
+        bf16_converted = 0
         compressor_converted = 0
 
         _shard_index = self._build_shard_index("/model") if os.path.isdir("/model") else None
 
         from tqdm import tqdm
-        for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc="  (upcast)NVFP4→FP8 wo_a only", unit="layer"):
+        for layer_idx, layer in tqdm(enumerate(self.layers), total=len(self.layers), desc="  (upcast)NVFP4→BF16 attn projs", unit="layer"):
             attn = layer.attn
             
-            # FP8 conversion: only wo_a
-            for proj_name in fp8_proj_names:
+            # BF16 dequantization: all attention projections
+            for proj_name in bf16_proj_names:
                 if not hasattr(attn, proj_name):
                     continue
                 mod = getattr(attn, proj_name)
@@ -1731,8 +1730,8 @@ class DeepseekV4Model(nn.Module):
                     continue
                 if mod.weight.dtype in (torch.uint8, torch.int8):
                     E2M1_LUT = torch.tensor([0, 0.5, 1, 1.5, 2, 3, 4, 6], dtype=torch.bfloat16)
-                    self._convert_nvfp4_to_fp8(mod, E2M1_LUT, FP8_MAX)
-                    fp8_converted += 1
+                    self._dequant_nvfp4_to_bf16(mod, E2M1_LUT)
+                    bf16_converted += 1
             
             # Compressor: still needs BF16 reconstruction
             mla_attn = getattr(attn, "mla_attn", None)