diff --git a/DEBUG_LOG.md b/DEBUG_LOG.md
new file mode 100644
index 00000000..0a604c05
--- /dev/null
+++ b/DEBUG_LOG.md
@@ -0,0 +1,130 @@
+# NVFP4 MegaMoE Debug Log
+
+## Current State (May 15, 2026)
+
+**Status:** Model produces garbage output. Deterministic prompt "The capital of France is" produces `-W'MSG173 ~SB…abych` instead of "Paris".
+
+## Symptoms
+
+- No NaN/Inf anywhere in the pipeline
+- Magnitudes at each stage look reasonable:
+  - L1 GEMM output: amax ~8-10
+  - SiLU activation: amax ~34-43
+  - L2 GEMM output: amax ~17-28
+  - Scatter output: amax ~5-15
+- FP4 activation quantization round-trip: reconstructed amax ~3.86, igs=1.4e-3
+- All 8 TP ranks produce identical weight shapes after transformation
+- Experts have distinct weights and scales (not duplicated)
+
+**Key observation:** The signal is there but buried. "Paris" appears at rank 2075/129280 — the model knows the word exists (logit 9.25) but top logits point at garbage tokens. This suggests a systematic error that preserves magnitude but distorts direction.
+
+## Pipeline Trace (per layer, from last inference)
+
+```
+[L1-GEMM-OUT]  slots=1 N=6144 amax=8.6250e+00
+[L1-SPLIT]     gate amax=7.1250e+00 | up amax=8.6250e+00
+[SILU-ACT]     amax=3.4500e+01
+[L2-GEMM-OUT]  slots=1 N=7168 amax=1.8500e+01
+[SCATTER]      y amax=6.7500e+00 slots=1
+```
+
+## Hypotheses Investigated
+
+### 1. ❌ NaN/Inf in GEMM
+Ruled out. All outputs finite, no NaN detected at any stage.
+
+### 2. ❌ Weight shape mismatch
+Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
+
+### 3. ❌ Global scale folding precision loss
+Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision in low-precision zone. Fixed by passing global scales as per-expert alpha instead of folding. Did not fix the garbage output.
+
+### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
+Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation (commit history). Standalone test shows cosine=1.0 and MSE=0.0 for random data.
+
+### 5. 🔍 E2M1 packing convention mismatch
+**Status: Open.** The CUTLASS kernel expects `nv_float4_t<float_e2m1_t>` packed as 2 nibbles per byte. Our `stage_activation` packs `(nibbles[..., 1] << 4) | nibbles[..., 0]` (even→low, odd→high). The checkpoint weights use the same convention. The standalone test showed cosine 1.0 with this packing, but both A and B were packed the same way — if both are wrong in the same way, the error cancels.
+
+### 6. 🔍 Attention output corruption from o_a_proj quantization
+**Status: Active investigation.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time because the model parameter is declared uint8. This is a lossy conversion of a 64M-parameter matrix that sits right before the MoE. If the quantization error here is significant, it propagates through all 61 MoE layers.
+
+The vLLM weight loader does:
+1. Compute per-block amax for the BF16 weight
+2. Compute global scale: `amax_max / (6.0 * 448.0)`
+3. Compute block scales: `amax / (6.0 * global_scale)` → float8
+4. Nearest-neighbor E2M1 quantization
+5. Pack 2 nibbles per byte: even→low, odd→high
+
+This may need to stay in native BF16 and route through a BF16 matmul path instead.
+
+### 7. 🔍 BF16 reference comparison
+**Status: In progress.** Adding a diagnostic that dequantizes FP4 activation + FP4 weights back to BF16, runs a reference matmul, then compares to the NVFP4 GEMM output via cosine similarity. This will isolate whether the CUTLASS kernel is producing correct output given the same quantized inputs.
+
+## Key Commits
+
+| Commit | Description |
+|--------|-------------|
+| `da5572f` | Stop folding global scale into float8 block scales (25% precision loss fix) |
+| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints |
+| `995589a` | Add FP4 quantization round-trip diagnostic |
+| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 |
+| `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
+
+## Bugs Fixed During This Debug Session
+
+### Weight nibble unpack reshape bug (commit `2fd55a9`)
+
+**Bug:** In the BF16 reference diagnostic, `torch.stack([wlo, whi], dim=-1).reshape(w_u8.shape[0], -1)` on a 2D weight of shape `(K_half, N)` = `(3584, 6144)` produced `(3584, 12288)` instead of `(7168, 6144)`. The `-1` was consuming the N dimension.
+
+**Fix:** Changed to `.reshape(w_u8.shape[0] * 2, w_u8.shape[1])` to preserve the column (N) dimension and double the row (K) dimension.
+
+**Impact:** Only affected the BF16 reference diagnostic code, not the actual NVFP4 kernel. The CUTLASS kernel receives weights already in the correct packed format.
+
+### igs double-count in reference (commit `2fd55a9`)
+
+**Bug:** The BF16 reference multiplied by `igs` (input global scale) in `x_bf16` AND again in `ref_out = ref_out * igs`.
+
+**Fix:** Removed the final `ref_out * igs` — it's already included via `x_bf16 = x_deq * sf_exp * igs`.
+
+**Impact:** Only affected the BF16 reference diagnostic, not the kernel.
+
+## Architecture Notes
+
+### DeepSeek-V4 MoE Layer Forward Pass
+```
+residual = x
+x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
+x = attn_norm(x)
+x = attn(x)                          ← o_a_proj is BF16→NVFP4 quantized here
+x = hc_post(x, residual, post, comb)
+
+residual = x
+x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
+x = ffn_norm(x)
+x = ffn(x)                           ← Our NVFP4 mega_moe kernel
+x = hc_post(x, residual, post, comb)
+```
+
+### NVFP4 MoE Pipeline
+```
+stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
+L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
+SiLU(gate) * up → activated
+stage_activation(activated) → l1_fp4, l1_sf, l1_igs
+L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
+scatter with routing weights → y
+```
+
+### Checkpoint Layers (layer 0)
+- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
+- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj)
+- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt
+- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4
+- **Gate weight, norms, sinks, position_bias:** BF16
+
+## Next Steps
+
+1. **Get BF16 reference cosine** — determine if the CUTLASS GEMM is correct
+2. **If cosine ≈ 1.0:** Problem is upstream (attention, likely o_a_proj). Fix: keep o_a_proj in native BF16
+3. **If cosine << 1.0:** Problem is in the CUTLASS GEMM or the activation quantization. Need to debug the kernel itself
+4. **Test with SKIP_ATTENTION=1** — bypass attention, feed raw input to MoE. If output improves, confirms attention is the issue