diff --git a/DEBUG_LOG.md b/DEBUG_LOG.md new file mode 100644 index 00000000..0a604c05 --- /dev/null +++ b/DEBUG_LOG.md @@ -0,0 +1,130 @@ +# NVFP4 MegaMoE Debug Log + +## Current State (May 15, 2026) + +**Status:** Model produces garbage output. Deterministic prompt "The capital of France is" produces `-W'MSG173 ~SB…abych` instead of "Paris". + +## Symptoms + +- No NaN/Inf anywhere in the pipeline +- Magnitudes at each stage look reasonable: + - L1 GEMM output: amax ~8-10 + - SiLU activation: amax ~34-43 + - L2 GEMM output: amax ~17-28 + - Scatter output: amax ~5-15 +- FP4 activation quantization round-trip: reconstructed amax ~3.86, igs=1.4e-3 +- All 8 TP ranks produce identical weight shapes after transformation +- Experts have distinct weights and scales (not duplicated) + +**Key observation:** The signal is there but buried. "Paris" appears at rank 2075/129280 — the model knows the word exists (logit 9.25) but top logits point at garbage tokens. This suggests a systematic error that preserves magnitude but distorts direction. + +## Pipeline Trace (per layer, from last inference) + +``` +[L1-GEMM-OUT] slots=1 N=6144 amax=8.6250e+00 +[L1-SPLIT] gate amax=7.1250e+00 | up amax=8.6250e+00 +[SILU-ACT] amax=3.4500e+01 +[L2-GEMM-OUT] slots=1 N=7168 amax=1.8500e+01 +[SCATTER] y amax=6.7500e+00 slots=1 +``` + +## Hypotheses Investigated + +### 1. ❌ NaN/Inf in GEMM +Ruled out. All outputs finite, no NaN detected at any stage. + +### 2. ❌ Weight shape mismatch +Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168). + +### 3. ❌ Global scale folding precision loss +Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision in low-precision zone. Fixed by passing global scales as per-expert alpha instead of folding. Did not fix the garbage output. + +### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED) +Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation (commit history). Standalone test shows cosine=1.0 and MSE=0.0 for random data. + +### 5. 🔍 E2M1 packing convention mismatch +**Status: Open.** The CUTLASS kernel expects `nv_float4_t` packed as 2 nibbles per byte. Our `stage_activation` packs `(nibbles[..., 1] << 4) | nibbles[..., 0]` (even→low, odd→high). The checkpoint weights use the same convention. The standalone test showed cosine 1.0 with this packing, but both A and B were packed the same way — if both are wrong in the same way, the error cancels. + +### 6. 🔍 Attention output corruption from o_a_proj quantization +**Status: Active investigation.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time because the model parameter is declared uint8. This is a lossy conversion of a 64M-parameter matrix that sits right before the MoE. If the quantization error here is significant, it propagates through all 61 MoE layers. + +The vLLM weight loader does: +1. Compute per-block amax for the BF16 weight +2. Compute global scale: `amax_max / (6.0 * 448.0)` +3. Compute block scales: `amax / (6.0 * global_scale)` → float8 +4. Nearest-neighbor E2M1 quantization +5. Pack 2 nibbles per byte: even→low, odd→high + +This may need to stay in native BF16 and route through a BF16 matmul path instead. + +### 7. 🔍 BF16 reference comparison +**Status: In progress.** Adding a diagnostic that dequantizes FP4 activation + FP4 weights back to BF16, runs a reference matmul, then compares to the NVFP4 GEMM output via cosine similarity. This will isolate whether the CUTLASS kernel is producing correct output given the same quantized inputs. + +## Key Commits + +| Commit | Description | +|--------|-------------| +| `da5572f` | Stop folding global scale into float8 block scales (25% precision loss fix) | +| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints | +| `995589a` | Add FP4 quantization round-trip diagnostic | +| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 | +| `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count | + +## Bugs Fixed During This Debug Session + +### Weight nibble unpack reshape bug (commit `2fd55a9`) + +**Bug:** In the BF16 reference diagnostic, `torch.stack([wlo, whi], dim=-1).reshape(w_u8.shape[0], -1)` on a 2D weight of shape `(K_half, N)` = `(3584, 6144)` produced `(3584, 12288)` instead of `(7168, 6144)`. The `-1` was consuming the N dimension. + +**Fix:** Changed to `.reshape(w_u8.shape[0] * 2, w_u8.shape[1])` to preserve the column (N) dimension and double the row (K) dimension. + +**Impact:** Only affected the BF16 reference diagnostic code, not the actual NVFP4 kernel. The CUTLASS kernel receives weights already in the correct packed format. + +### igs double-count in reference (commit `2fd55a9`) + +**Bug:** The BF16 reference multiplied by `igs` (input global scale) in `x_bf16` AND again in `ref_out = ref_out * igs`. + +**Fix:** Removed the final `ref_out * igs` — it's already included via `x_bf16 = x_deq * sf_exp * igs`. + +**Impact:** Only affected the BF16 reference diagnostic, not the kernel. + +## Architecture Notes + +### DeepSeek-V4 MoE Layer Forward Pass +``` +residual = x +x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base) +x = attn_norm(x) +x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here +x = hc_post(x, residual, post, comb) + +residual = x +x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base) +x = ffn_norm(x) +x = ffn(x) ← Our NVFP4 mega_moe kernel +x = hc_post(x, residual, post, comb) +``` + +### NVFP4 MoE Pipeline +``` +stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale +L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up +SiLU(gate) * up → activated +stage_activation(activated) → l1_fp4, l1_sf, l1_igs +L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output +scatter with routing weights → y +``` + +### Checkpoint Layers (layer 0) +- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale) +- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj) +- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt +- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4 +- **Gate weight, norms, sinks, position_bias:** BF16 + +## Next Steps + +1. **Get BF16 reference cosine** — determine if the CUTLASS GEMM is correct +2. **If cosine ≈ 1.0:** Problem is upstream (attention, likely o_a_proj). Fix: keep o_a_proj in native BF16 +3. **If cosine << 1.0:** Problem is in the CUTLASS GEMM or the activation quantization. Need to debug the kernel itself +4. **Test with SKIP_ATTENTION=1** — bypass attention, feed raw input to MoE. If output improves, confirms attention is the issue