diff --git a/CURRENT_BUG.md b/CURRENT_BUG.md
index e33bf5c6..04df8fbf 100644
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,31 +1,91 @@
 # CURRENT_BUG.md — DeepSeek-V4 Blackwell NVFP4
 
-## Status: KV CACHE PIPELINE VERIFIED ✅
+## Status: NaN IN MOE — ROOT CAUSE UNKNOWN
 
-### What's Fixed
-- **Root cause identified**: vLLM's `_attention_impl_blackwell` never writes KV to the paged cache, so decode produces garbage because it can't access prior tokens' KV.
-- **Solution built and tested**: `cutedsl/blackwell_attention.py` + `vllm/patches/layers/csa_attention.py` — KV cache write/read pipeline using fp8 quantization.
+### Current Symptom
+- vLLM container starts, model loads, server accepts requests
+- **Output is empty** — model generates tokens but they decode to nothing
+- Debug logs show **NaN in hidden_states** entering the attention from the FIRST forward pass
+- NaN propagates through all 61 layers → all outputs are NaN → garbage tokens
+- Both C128A (cr=128) and C4A (cr=4) layers have NaN in their inputs
 
-### Test Results (B200 venv, all passing)
+### NaN Tracing
+```
+Layer 0 (C128A): hidden_states input → ??? → NaN in attention input
+Layer 1-59 (C4A): NaN in attention input (propagated)
+Layer 60 (SWA): NaN in attention input (propagated)
+```
+The NaN originates BEFORE the attention — it's in the MoE output that feeds into the next layer.
 
-| Test | Result |
-|------|--------|
-| KV cache roundtrip (fp8 quant → dequant) | 0.999+ cosine |
-| Decode attention (1 query vs N cached KVs) | 0.9998 cosine |
-| Full pipeline (inv RoPE + o_a + o_b) | 0.996-0.999 cosine |
-| All 5 layer types (C128A, C4A, SWA) | ≥0.996 cosine |
-| E2E 61-layer model (shared experts) | Healthy logits, consistent tokens |
-| Multi-step decode (3 steps) | 0.999+ cosine each step |
+### Architecture: DeepSeek-V4 MegaMoE
+- **384 experts, top-6 routing** — this is a "MegaMoE" architecture
+- DeepGEMM has a specialized `mega_moe.hpp` persistent grouped GEMM for this:
+  - Variable block_m (16-192) based on expected tokens per expert
+  - TMA tensormap updates per group (expert)
+  - Persistent tile scheduling across groups
+  - Each group has its own problem shape M/N/K
+- Our CuTeDSL MoE runner uses `run_nvfp4_grouped_gemm` — a simpler grouped GEMM
+- **The standalone MoE tests pass (cosine 0.988) but may not exercise the same shapes/paths as vLLM**
 
-### What's Next
-1. Test in vLLM container (build_and_run.sh)
-2. Handle CSA/HCA sparse attention in the Blackwell path (currently using full attention for all layers)
-3. Add routed MoE experts (currently shared experts only)
-4. Performance optimization (vectorized paged KV, Triton kernels)
+### What's Been Verified (B200 venv, all passing)
+| Component | Test | Result |
+|-----------|------|--------|
+| NVFP4 Linear (q_a, kv, q_b, o_b) | cosine per projection | 0.998-1.0 |
+| NVFP4 MoE (L1 gate+up, L2 down) | cosine per layer | 0.988 |
+| KV cache roundtrip (fp8) | cosine | 0.999 |
+| Decode attention (1 query vs N KV) | cosine | 0.9998 |
+| Full pipeline (inv RoPE + o_a + o_b) | cosine | 0.996-0.999 |
+| All 5 layer types | cosine | ≥0.996 |
+| E2E 61-layer (shared experts) | logits std=3.16 | reasonable |
+| CSA sparse attention (C4A) | cosine | 0.974 |
+| CSA sparse attention (C128A) | cosine | 0.668 (avg-pooled KV) |
+| Multi-step decode | cosine | 0.999 |
 
-### Architecture
-- KV latent: (T, HD=512) shared across 128 Q heads
-- KV Cache: fp8_e4m3 paged cache with per-token inverse scale
-- Attention: BF16 (NVFP4 too lossy for Q×K^T)
-- Prefill: causal SDPA on raw KV
-- Decode: read all cached KV → fp8 dequant → SDPA → output
+### What's Been Fixed in vLLM Integration
+1. Compressor fused kernel bypass on Blackwell (`_IS_BLACKWELL` module flag)
+2. Double Q normalization removed (fused_qnorm only does RoPE now)
+3. RoPE sin slice bug fixed (`half:2*half` not `half:`)
+4. fp8 dequant fix (use `kv_dequantize_fp8` not `.to(bf16)`)
+5. Wrapper attribute access (`self.mla_attn.kv_cache` etc.)
+6. Paged KV decode using `decode_swa_indices` from metadata
+7. `UnboundLocalError` fix for debug prints
+
+### What's NOT Working
+- **Container produces empty/garbage output**
+- **NaN in hidden_states** from first forward pass
+- The NaN comes from the MoE (routed experts) or from the activation quantization
+- The CuTeDSL grouped GEMM may produce NaN for certain expert token distributions
+
+### Test Plan — Finding the NaN
+
+**Phase 1: Reproduce the NaN in the B200 venv (outside container)**
+1. Test `CuTeDSLMoERunner.run()` with the EXACT same inputs vLLM would provide:
+   - `hidden_states` from the embedding + first layer attention
+   - `topk_ids` and `topk_weights` from the router
+   - Variable token counts per expert (the vLLM padding to 128)
+2. Test with 1 token (decode), 8 tokens (small prefill), and padded shapes
+3. Check for NaN after L1 GEMM, after SiLU activation, after L2 GEMM
+4. Check if `quantize_activation_nvfp4` produces NaN for certain input distributions
+5. Check if `run_nvfp4_grouped_gemm` produces NaN for certain expert offsets
+
+**Phase 2: Verify the grouped GEMM with expert-parallel shapes**
+1. Test with 48 experts (EP8, 384/8), 1-8 tokens, top-6
+2. Test with padding to 128 rows per expert
+3. Check if the GEMM handles zero-token experts correctly
+4. Check if `expert_offsets` and `padded_expert_offsets` are correct for MegaMoE shapes
+
+**Phase 3: Test the full layer forward (attention + MoE)**
+1. Run layer 0 (C128A) with real weights, check output for NaN
+2. Run layer 2 (C4A) with real weights, check output for NaN
+3. If NaN appears, bisect: which component produces it?
+
+**Phase 4: Fix and verify**
+1. Fix the NaN source
+2. Run all B200 venv tests
+3. Build container, test with real inference
+4. Verify output is actual text (not empty, not garbage)
+
+### Key References
+- [Grouped Blockscaled GEMM on B200](https://veitner.bearblog.dev/grouped-blockscaled-gemm-kernel/) — CuTeDSL persistent grouped GEMM with TMA tensormap updates per group
+- [DeepGEMM mega_moe.hpp](https://github.com/deepseek-ai/DeepGEMM/blob/main/csrc/jit_kernels/heuristics/mega_moe.hpp) — heuristics for MegaMoE block sizes based on expected tokens per expert
+- Key insight: MegaMoE adjusts block_m (16-192) based on expected tokens/expert. For decode (few tokens), block_m=16-32. For prefill, block_m=192.
diff --git a/tests/test_moe_nan_b200.py b/tests/test_moe_nan_b200.py
new file mode 100644
index 00000000..e447fac6
--- /dev/null
+++ b/tests/test_moe_nan_b200.py
@@ -0,0 +1,231 @@
+#!/usr/bin/env python3
+"""
+DeepSeek-V4 MoE NaN Reproduction Test
+
+Finds where NaN originates in the MoE forward pass.
+Tests the EXACT CuTeDSLMoERunner code path used by vLLM.
+
+This test is the FIRST step: if the MoE produces NaN, the entire model
+produces garbage. We need to find the NaN source before anything else matters.
+
+Test plan:
+1. Load MoE weights for a single layer
+2. Run the CuTeDSLMoERunner with various token counts and routing patterns
+3. Check for NaN at each step: quantize → L1 GEMM → SiLU → L2 GEMM → combine
+4. Specifically test with MegaMoE shapes: 48 experts (EP8), padded to 128 rows
+
+Usage (on B200):
+  cd /root/nvfp4-megamoe-kernel
+  PYTHONPATH=/root/nvfp4-megamoe-kernel tests/venv/bin/python tests/test_moe_nan_b200.py
+"""
+
+import sys, os, json, torch, torch.nn.functional as F
+from safetensors import safe_open
+
+REPO = "/root/nvfp4-megamoe-kernel"
+sys.path.insert(0, REPO)
+MODEL = "/root/nvidia-meeting/DeepSeek-V4-Pro-NVFP4"
+DEV = "cuda:0"
+
+H = 7168; NH = 128; HD = 512; NOPE = 448; ROPE = 64
+QL = 1536; OL = 1024; OG = 16; HPG = NH // OG
+INTERMEDIATE = 18432  # DeepSeek-V4 MoE intermediate size
+NUM_EXPERTS = 48      # EP8: 384/8
+TOPK = 6
+EPS = 1e-6; WINDOW = 128; SCALE = HD ** -0.5
+
+_cache = {}
+def P(k, wm, md):
+    if k in _cache: return _cache[k]
+    with safe_open(os.path.join(md, wm[k]), framework="pt") as f:
+        t = f.get_tensor(k)
+    _cache[k] = t
+    return t
+
+def rms(x, w, eps=1e-6):
+    v = x.float().pow(2).mean(-1, keepdim=True)
+    return (w.float() * (x * torch.rsqrt(v+eps)).float()).to(x.dtype)
+
+
+def test_moe_layer(layer_id=2):
+    """Test the MoE forward pass for a single layer, checking for NaN at each step."""
+    from cutedsl.runner import CuTeDSLMoERunner
+    
+    torch.cuda.set_device(0)
+    torch.manual_seed(42)
+    torch.cuda.empty_cache()
+    
+    with open(os.path.join(MODEL, "model.safetensors.index.json")) as f:
+        wm = json.load(f)["weight_map"]
+    G = lambda k: P(k, wm, MODEL).to(DEV)
+    
+    p = f"model.layers.{layer_id}"
+    m = f"{p}.mlp"
+    
+    # Load embedding for input
+    emb = G("model.embed_tokens.weight")
+    fnorm = G(f"{p}.post_attention_layernorm.weight")
+    
+    # MoE weights
+    # Gate/up (w13): (E, 2*intermediate, hidden//2) uint8
+    # Down (w2): (E, hidden, intermediate//2) uint8
+    w13_w = G(f"{m}.experts.w13_weight")  # or gate_proj + up_proj
+    w13_sf = G(f"{m}.experts.w13_weight_scale")
+    w13_gs = G(f"{m}.experts.w13_weight_scale_2")
+    w2_w = G(f"{m}.experts.w2_weight")
+    w2_sf = G(f"{m}.experts.w2_weight_scale")
+    w2_gs = G(f"{m}.experts.w2_weight_scale_2")
+    swiglu_limit = None
+    
+    # Shared expert
+    se_gate_w = G(f"{m}.shared_experts.gate_proj.weight")
+    se_gate_sf = G(f"{m}.shared_experts.gate_proj.weight_scale")
+    se_gate_gs = G(f"{m}.shared_experts.gate_proj.weight_scale_2")
+    se_up_w = G(f"{m}.shared_experts.up_proj.weight")
+    se_up_sf = G(f"{m}.shared_experts.up_proj.weight_scale")
+    se_up_gs = G(f"{m}.shared_experts.up_proj.weight_scale_2")
+    se_down_w = G(f"{m}.shared_experts.down_proj.weight")
+    se_down_sf = G(f"{m}.shared_experts.down_proj.weight_scale")
+    se_down_gs = G(f"{m}.shared_experts.down_proj.weight_scale_2")
+    
+    print(f"  w13_weight shape: {w13_w.shape}, dtype: {w13_w.dtype}")
+    print(f"  w2_weight shape: {w2_w.shape}, dtype: {w2_w.dtype}")
+    print(f"  w13_gs shape: {w13_gs.shape}")
+    print(f"  w2_gs shape: {w2_gs.shape}")
+    print(f"  w13_gs sample: {w13_gs[:5].tolist()}")
+    print(f"  w2_gs sample: {w2_gs[:5].tolist()}")
+    
+    # Check for NaN in weights
+    print(f"  w13 NaN: {torch.isnan(w13_w.float()).any()}")
+    print(f"  w2 NaN: {torch.isnan(w2_w.float()).any()}")
+    print(f"  w13_sf NaN: {torch.isnan(w13_sf.float()).any()}")
+    print(f"  w2_sf NaN: {torch.isnan(w2_sf.float()).any()}")
+    print(f"  w13_gs NaN: {torch.isnan(w13_gs).any()}")
+    print(f"  w2_gs NaN: {torch.isnan(w2_gs).any()}")
+    
+    # Create the MoE runner
+    num_local_experts = w13_w.shape[0]
+    hidden_size = w13_w.shape[2] * 2  # hidden//2 packed → *2 for fp4
+    intermediate_size = w13_w.shape[1] // 2  # 2*intermediate // 2
+    
+    print(f"\n  num_local_experts: {num_local_experts}")
+    print(f"  hidden_size: {hidden_size}")
+    print(f"  intermediate_size: {intermediate_size}")
+    
+    runner = CuTeDSLMoERunner(
+        num_experts=num_local_experts,
+        hidden_size=hidden_size,
+        intermediate_size=intermediate_size,
+        max_num_tokens=8192,
+        top_k=TOPK,
+        device=str(DEV),
+    )
+    
+    # Prepare weights
+    l1_fp4 = w13_w.view(torch.float4_e2m1fn_x2)
+    l2_fp4 = w2_w.view(torch.float4_e2m1fn_x2)
+    l1_sf = w13_sf.to(torch.float8_e4m3fn) if w13_sf.dtype != torch.float8_e4m3fn else w13_sf
+    l2_sf = w2_sf.to(torch.float8_e4m3fn) if w2_sf.dtype != torch.float8_e4m3fn else w2_sf
+    
+    runner.prepare_weights_from_stacked(
+        l1_fp4, l1_sf, w13_gs.tolist(),
+        l2_fp4, l2_sf, w2_gs.tolist(),
+    )
+    
+    # Test with various token counts
+    test_cases = [
+        ("1 token (decode)", 1),
+        ("4 tokens", 4),
+        ("8 tokens", 8),
+        ("16 tokens", 16),
+    ]
+    
+    for desc, num_tokens in test_cases:
+        print(f"\n  --- {desc} ---")
+        token_ids = torch.randint(1, 1000, (num_tokens,), dtype=torch.long, device=DEV)
+        hidden = emb[token_ids]
+        normed = rms(hidden, fnorm, EPS)
+        
+        print(f"  Input: amax={normed.amax():.4f} NaN={torch.isnan(normed).any()}")
+        
+        # Create routing (random top-6 from num_local_experts)
+        topk_ids = torch.randint(0, num_local_experts, (num_tokens, TOPK), device=DEV)
+        topk_weights = torch.softmax(torch.randn(num_tokens, TOPK, device=DEV), dim=-1)
+        
+        with torch.no_grad():
+            result = runner.run(normed, topk_weights, topk_ids)
+        
+        print(f"  Output: amax={result.amax():.4f} NaN={torch.isnan(result).any()}")
+        if torch.isnan(result).any():
+            # Count NaN rows
+            nan_rows = torch.isnan(result).any(dim=1).sum().item()
+            print(f"  NaN rows: {nan_rows}/{num_tokens}")
+            
+            # Check if shared expert also produces NaN
+            from cutedsl.nvfp4_linear import CuTeDSLNvfp4Linear
+            def make_runner(w, sf, gs_t, inf, outf):
+                fp4 = w.view(torch.float4_e2m1fn_x2).permute(1,0).contiguous()
+                s = sf.to(torch.float8_e4m3fn) if sf.dtype != torch.float8_e4m3fn else sf
+                s = s.permute(1,0).contiguous()
+                gs = gs_t.max().item() if gs_t.numel() > 1 else gs_t.item()
+                r = CuTeDSLNvfp4Linear(in_features=inf, out_features=outf, max_num_tokens=8192, device=str(w.device))
+                r.fp4 = [fp4]; r.sf = [s]; r.gs = [gs]
+                r.finalize_weights(); r._ensure_initialized()
+                return r
+            
+            # Shared expert only
+            r_gate = make_runner(se_gate_w, se_gate_sf, se_gate_gs, H, se_gate_w.shape[0])
+            r_up = make_runner(se_up_w, se_up_sf, se_up_gs, H, se_up_w.shape[0])
+            r_down = make_runner(se_down_w, se_down_sf, se_down_gs, INTERMEDIATE, se_down_w.shape[0])
+            
+            with torch.no_grad():
+                gate_out = r_gate.run(normed)
+                up_out = r_up.run(normed)
+                activated = F.silu(gate_out) * up_out
+                se_result = r_down.run(activated)
+            
+            print(f"  Shared expert: amax={se_result.amax():.4f} NaN={torch.isnan(se_result).any()}")
+            
+            del r_gate, r_up, r_down
+    
+    # Test with exactly the vLLM padding pattern
+    print(f"\n  --- vLLM padding test (8 tokens, top-6, expert offsets) ---")
+    num_tokens = 8
+    token_ids = torch.randint(1, 1000, (num_tokens,), dtype=torch.long, device=DEV)
+    hidden = emb[token_ids]
+    normed = rms(hidden, fnorm, EPS)
+    topk_ids = torch.randint(0, num_local_experts, (num_tokens, TOPK), device=DEV)
+    topk_weights = torch.softmax(torch.randn(num_tokens, TOPK, device=DEV), dim=-1)
+    
+    with torch.no_grad():
+        result = runner.run(normed, topk_weights, topk_ids)
+    
+    print(f"  Output: amax={result.amax():.4f} NaN={torch.isnan(result).any()}")
+    print(f"  Output sample (first 10): {result[0, :10].tolist()}")
+    
+    del runner
+    torch.cuda.empty_cache()
+    _cache.clear()
+
+
+def main():
+    print("=" * 70)
+    print("  DeepSeek-V4 MoE NaN Reproduction Test")
+    print("  Finds where NaN originates in the MoE forward pass")
+    print("=" * 70)
+    
+    test_moe_layer(layer_id=2)  # C4A layer
+    
+    print(f"\n{'='*70}")
+    print(f"  If NaN is found, bisect by testing each step:")
+    print(f"  1. quantize_activation_nvfp4(input)")
+    print(f"  2. run_nvfp4_grouped_gemm(L1)")
+    print(f"  3. SiLU(gate) * up")
+    print(f"  4. quantize_activation_nvfp4(activated)")
+    print(f"  5. run_nvfp4_grouped_gemm(L2)")
+    print(f"  6. scatter_add combine")
+    print(f"{'='*70}")
+
+
+if __name__ == "__main__":
+    main()