6.6 KiB
NVFP4 MegaMoE Debug Log
Current State (May 15, 2026)
Status: Model produces garbage output. Deterministic prompt "The capital of France is" produces -W'MSG173 ~SB…abych instead of "Paris".
Symptoms
- No NaN/Inf anywhere in the pipeline
- Magnitudes at each stage look reasonable:
- L1 GEMM output: amax ~8-10
- SiLU activation: amax ~34-43
- L2 GEMM output: amax ~17-28
- Scatter output: amax ~5-15
- FP4 activation quantization round-trip: reconstructed amax ~3.86, igs=1.4e-3
- All 8 TP ranks produce identical weight shapes after transformation
- Experts have distinct weights and scales (not duplicated)
Key observation: The signal is there but buried. "Paris" appears at rank 2075/129280 — the model knows the word exists (logit 9.25) but top logits point at garbage tokens. This suggests a systematic error that preserves magnitude but distorts direction.
Pipeline Trace (per layer, from last inference)
[L1-GEMM-OUT] slots=1 N=6144 amax=8.6250e+00
[L1-SPLIT] gate amax=7.1250e+00 | up amax=8.6250e+00
[SILU-ACT] amax=3.4500e+01
[L2-GEMM-OUT] slots=1 N=7168 amax=1.8500e+01
[SCATTER] y amax=6.7500e+00 slots=1
Hypotheses Investigated
1. ❌ NaN/Inf in GEMM
Ruled out. All outputs finite, no NaN detected at any stage.
2. ❌ Weight shape mismatch
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
3. ❌ Global scale folding precision loss
Previously identified (commit da5572f). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision in low-precision zone. Fixed by passing global scales as per-expert alpha instead of folding. Did not fix the garbage output.
4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation (commit history). Standalone test shows cosine=1.0 and MSE=0.0 for random data.
5. 🔍 E2M1 packing convention mismatch
Status: Open. The CUTLASS kernel expects nv_float4_t<float_e2m1_t> packed as 2 nibbles per byte. Our stage_activation packs (nibbles[..., 1] << 4) | nibbles[..., 0] (even→low, odd→high). The checkpoint weights use the same convention. The standalone test showed cosine 1.0 with this packing, but both A and B were packed the same way — if both are wrong in the same way, the error cancels.
6. 🔍 Attention output corruption from o_a_proj quantization
Status: Active investigation. The checkpoint has o_a_proj.weight as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time because the model parameter is declared uint8. This is a lossy conversion of a 64M-parameter matrix that sits right before the MoE. If the quantization error here is significant, it propagates through all 61 MoE layers.
The vLLM weight loader does:
- Compute per-block amax for the BF16 weight
- Compute global scale:
amax_max / (6.0 * 448.0) - Compute block scales:
amax / (6.0 * global_scale)→ float8 - Nearest-neighbor E2M1 quantization
- Pack 2 nibbles per byte: even→low, odd→high
This may need to stay in native BF16 and route through a BF16 matmul path instead.
7. 🔍 BF16 reference comparison
Status: In progress. Adding a diagnostic that dequantizes FP4 activation + FP4 weights back to BF16, runs a reference matmul, then compares to the NVFP4 GEMM output via cosine similarity. This will isolate whether the CUTLASS kernel is producing correct output given the same quantized inputs.
Key Commits
| Commit | Description |
|---|---|
da5572f |
Stop folding global scale into float8 block scales (25% precision loss fix) |
d0ed3d8 |
Add L2, SiLU, and scatter pipeline prints |
995589a |
Add FP4 quantization round-trip diagnostic |
c421a66 |
Add BF16 reference GEMM + cosine comparison for L1 |
2fd55a9 |
Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
Bugs Fixed During This Debug Session
Weight nibble unpack reshape bug (commit 2fd55a9)
Bug: In the BF16 reference diagnostic, torch.stack([wlo, whi], dim=-1).reshape(w_u8.shape[0], -1) on a 2D weight of shape (K_half, N) = (3584, 6144) produced (3584, 12288) instead of (7168, 6144). The -1 was consuming the N dimension.
Fix: Changed to .reshape(w_u8.shape[0] * 2, w_u8.shape[1]) to preserve the column (N) dimension and double the row (K) dimension.
Impact: Only affected the BF16 reference diagnostic code, not the actual NVFP4 kernel. The CUTLASS kernel receives weights already in the correct packed format.
igs double-count in reference (commit 2fd55a9)
Bug: The BF16 reference multiplied by igs (input global scale) in x_bf16 AND again in ref_out = ref_out * igs.
Fix: Removed the final ref_out * igs — it's already included via x_bf16 = x_deq * sf_exp * igs.
Impact: Only affected the BF16 reference diagnostic, not the kernel.
Architecture Notes
DeepSeek-V4 MoE Layer Forward Pass
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)
residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x) ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
NVFP4 MoE Pipeline
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
Checkpoint Layers (layer 0)
- MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
- o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
- o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
- Gate weight, norms, sinks, position_bias: BF16
Next Steps
- Get BF16 reference cosine — determine if the CUTLASS GEMM is correct
- If cosine ≈ 1.0: Problem is upstream (attention, likely o_a_proj). Fix: keep o_a_proj in native BF16
- If cosine << 1.0: Problem is in the CUTLASS GEMM or the activation quantization. Need to debug the kernel itself
- Test with SKIP_ATTENTION=1 — bypass attention, feed raw input to MoE. If output improves, confirms attention is the issue