Files
nvfp4-megamoe-kernel/DEBUG_LOG.md

9.8 KiB
Raw Blame History

NVFP4 MegaMoE Debug Log

Current State (May 15, 2026)

Status: Model produces garbage output. Deterministic prompt "The capital of France is" produces -W'MSG173 ~SB…abych instead of "Paris".

Symptoms

  • No NaN/Inf anywhere in the pipeline
  • Magnitudes at each stage look reasonable:
    • L1 GEMM output: amax ~8-10
    • SiLU activation: amax ~34-43
    • L2 GEMM output: amax ~17-28
    • Scatter output: amax ~5-15
  • FP4 activation quantization round-trip: reconstructed amax ~3.86, igs=1.4e-3
  • All 8 TP ranks produce identical weight shapes after transformation
  • Experts have distinct weights and scales (not duplicated)

Key observation: The signal is there but buried. "Paris" appears at rank 2075/129280 — the model knows the word exists (logit 9.25) but top logits point at garbage tokens. This suggests a systematic error that preserves magnitude but distorts direction.

Pipeline Trace (per layer, from last inference)

[L1-GEMM-OUT]  slots=1 N=6144 amax=8.6250e+00
[L1-SPLIT]     gate amax=7.1250e+00 | up amax=8.6250e+00
[SILU-ACT]     amax=3.4500e+01
[L2-GEMM-OUT]  slots=1 N=7168 amax=1.8500e+01
[SCATTER]      y amax=6.7500e+00 slots=1

Hypotheses Investigated

1. NaN/Inf in GEMM

Ruled out. All outputs finite, no NaN detected at any stage.

2. Weight shape mismatch

Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).

3. Global scale folding precision loss

Previously identified (commit da5572f). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision in low-precision zone. Fixed by passing global scales as per-expert alpha instead of folding. Did not fix the garbage output.

4. Broken kernel (CUDA_ERROR_LAUNCH_FAILED)

Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation (commit history). Standalone test shows cosine=1.0 and MSE=0.0 for random data.

5. 🔍 E2M1 packing convention mismatch

Status: Open. The CUTLASS kernel expects nv_float4_t<float_e2m1_t> packed as 2 nibbles per byte. Our stage_activation packs (nibbles[..., 1] << 4) | nibbles[..., 0] (even→low, odd→high). The checkpoint weights use the same convention. The standalone test showed cosine 1.0 with this packing, but both A and B were packed the same way — if both are wrong in the same way, the error cancels.

6. 🔍 Attention output corruption from o_a_proj quantization

Status: Active investigation. The checkpoint has o_a_proj.weight as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time because the model parameter is declared uint8. This is a lossy conversion of a 64M-parameter matrix that sits right before the MoE. If the quantization error here is significant, it propagates through all 61 MoE layers.

The vLLM weight loader does:

  1. Compute per-block amax for the BF16 weight
  2. Compute global scale: amax_max / (6.0 * 448.0)
  3. Compute block scales: amax / (6.0 * global_scale) → float8
  4. Nearest-neighbor E2M1 quantization
  5. Pack 2 nibbles per byte: even→low, odd→high

This may need to stay in native BF16 and route through a BF16 matmul path instead.

7. BF16 reference comparison — COSINE ≈ 0

Status: CONFIRMED. The BF16 reference comparison ran (after fixing several bugs in the diagnostic code). Result: cosine similarity ≈ 0 between NVFP4 GEMM output and BF16 dequantized reference. This means the CUTLASS kernel is producing output that is essentially uncorrelated with the correct result.

Results from all 8 TP ranks:

[TP0] cosine=-0.001789  mse=1.0201e+01  nvfp4_amax=8.5625  ref_amax=8.0000
[TP1] cosine= 0.030470  mse=1.0157e+01  nvfp4_amax=8.0625  ref_amax=8.3125
[TP2] cosine=-0.009217  mse=9.5978e+00  nvfp4_amax=9.1875  ref_amax=7.5312
[TP3] cosine= 0.001786  mse=9.4161e+00  nvfp4_amax=8.6875  ref_amax=8.8750
[TP4] cosine= 0.007108  mse=7.5709e+00  nvfp4_amax=7.3125  ref_amax=8.8750
[TP5] cosine=-0.000572  mse=7.8290e+00  nvfp4_amax=7.5938  ref_amax=10.562
[TP6] cosine= 0.012143  mse=9.2720e+00  nvfp4_amax=8.0000  ref_amax=8.1250
[TP7] cosine=-0.010009  mse=9.0296e+00  nvfp4_amax=6.6250  ref_amax=36.500

Key insight: The magnitudes are in the same ballpark (amax 7-10 vs 8-10), but the direction is completely wrong. This is NOT a scaling error — it's a systematic misalignment. The output vectors are essentially random relative to the correct answer.

This proves the problem is in the CUTLASS GEMM itself (or the data layout going into it), NOT in the attention, weight loading, or scaling math. The standalone test with random data showed cosine 1.0, but real data gives cosine ≈ 0. The difference must be in data layout/stride/alignment that the random test didn't exercise.

8. 🔍 CUTLASS GEMM layout mismatch

Status: Active investigation. The standalone test used random data with simple row-major layout and got cosine 1.0. Real data also uses row-major layout, but cosine ≈ 0. Possible causes:

  • SF remap incorrect for specific M/N/K dimensions — the remap was verified with coordinate probes for the standalone test dimensions, but real MoE dimensions (M=1, N=6144, K=7168) may expose a different code path
  • Activation layoutstage_activation produces flat row-major packed E2M1, but CUTLASS may expect a different micro-tiling for the A matrix
  • Weight transpose convention — after transform_nvfp4_weights_for_mega_moe transpose, the weight may not be in the layout CUTLASS expects for B (column-major vs row-major interpretation)

Key Commits

Commit Description
da5572f Stop folding global scale into float8 block scales (25% precision loss fix)
d0ed3d8 Add L2, SiLU, and scatter pipeline prints
995589a Add FP4 quantization round-trip diagnostic
c421a66 Add BF16 reference GEMM + cosine comparison for L1
2fd55a9 Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count
9159cb6 Add DEBUG_LOG.md documentation
de8acc7 Dump raw GEMM inputs + first 8 output values
755f9ad Fix per_expert_alpha ref + clean up BF16 reference scaling
df916b8 Fix gs.item() for multi-element tensor
7739674 Fix gs scalar conversion with .cpu().tolist() + add traceback

Bugs Fixed During This Debug Session

Weight nibble unpack reshape bug (commit 2fd55a9)

Bug: In the BF16 reference diagnostic, torch.stack([wlo, whi], dim=-1).reshape(w_u8.shape[0], -1) on a 2D weight of shape (K_half, N) = (3584, 6144) produced (3584, 12288) instead of (7168, 6144). The -1 was consuming the N dimension.

Fix: Changed to .reshape(w_u8.shape[0] * 2, w_u8.shape[1]) to preserve the column (N) dimension and double the row (K) dimension.

Impact: Only affected the BF16 reference diagnostic code, not the actual NVFP4 kernel. The CUTLASS kernel receives weights already in the correct packed format.

igs double-count in reference (commit 2fd55a9)

Bug: The BF16 reference multiplied by igs (input global scale) in x_bf16 AND again in ref_out = ref_out * igs.

Fix: Removed the final ref_out * igs — it's already included via x_bf16 = x_deq * sf_exp * igs.

Impact: Only affected the BF16 reference diagnostic, not the kernel.

BF16 reference diagnostic: multiple bugs (commits c421a667739674)

The BF16 reference comparison had a cascade of bugs that took 4 iterations to fix:

  1. Weight reshape bug (commit 2fd55a9): reshape(K_half, -1) on 2D weight flattened N dimension. Fixed: reshape(K_half*2, N).
  2. per_expert_alpha not defined (commit 755f9ad): The reference code ran before per_expert_alpha was computed. Fixed: use l1_alpha * l1_global_sf[e0] directly.
  3. gs.item() on multi-element tensor (commits df916b8, 7739674): gs is shape (2,) — gs[0].item() should work but didn't in context. Fixed: gs.detach().cpu().tolist().
  4. igs double-count (commit 2fd55a9): Multiplying by igs in both x_bf16 and the final output. Fixed: apply igs once in x, apply gs per-half separately.

Impact: All bugs only in diagnostic code. The actual NVFP4 kernel was never affected.

Architecture Notes

DeepSeek-V4 MoE Layer Forward Pass

residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x)                          ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)

residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x)                           ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)

NVFP4 MoE Pipeline

stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y

Checkpoint Layers (layer 0)

  • MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
  • Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
  • o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
  • o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
  • Gate weight, norms, sinks, position_bias: BF16

Next Steps

  1. Get BF16 reference cosine — determine if the CUTLASS GEMM is correct
  2. If cosine ≈ 1.0: Problem is upstream (attention, likely o_a_proj). Fix: keep o_a_proj in native BF16
  3. If cosine << 1.0: Problem is in the CUTLASS GEMM or the activation quantization. Need to debug the kernel itself
  4. Test with SKIP_ATTENTION=1 — bypass attention, feed raw input to MoE. If output improves, confirms attention is the issue