11 KiB
NVFP4 MegaMoE Debug Log
Current State (May 15, 2026)
Status: Root cause identified and fixed. Awaiting rebuild and test.
Root cause: The SF (scale factor) remap kernel in cutlass_nvfp4_gemm.cu used cute::size(layout_sf) as the iteration bound instead of cute::cosize(layout_sf). The size returns the logical size; cosize returns the physical size including tile padding. The destination buffer was allocated with cosize elements (correct) and zero-initialized, but the kernel only iterated over size elements (incorrect), leaving tile-padding positions as zero instead of their actual SF values.
Why it was invisible in the all-ones test: When all SF values are identical (uniform data), missing writes don't matter — every position should have the same value, and the ones that got written have the right one. The standalone test from the previous session used a single global scale for all blocks, producing uniform SF, which is why it showed cosine 1.0.
Why it broke with real data: Different blocks have different SF values. The tile-padding positions in the CUTLASS interleaved SF layout need specific SF values, but they were left as zero. CUTLASS reads those positions during the GEMM, getting zero scales instead of the correct values, which scrambles the output direction while preserving approximate magnitude.
Fix: One-line change in cutlass_nvfp4_gemm.cu line 128: cute::size → cute::cosize (commit c384198).
Original symptoms:
- Deterministic prompt "The capital of France is" →
-W'MSG173 ~SB…abychinstead of "Paris" - No NaN/Inf, magnitudes reasonable, but cosine similarity ≈ 0 between NVFP4 GEMM and BF16 reference
How We Found It
Step 1: Pipeline trace
Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.
Step 2: BF16 reference comparison
Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. Result: cosine ≈ 0 across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.
Step 3: Standalone GEMM tests
- All-ones data (M=1, N=32, K=32): cosine = 1.0 ✅
- Random data (M=1, N=32, K=32): cosine ≈ 0.2 ❌
- Random data (M=128, N=6144, K=7168): cosine ≈ 0 ❌
The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.
Step 4: Found the bug
The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used cosize correctly. But the iteration bound in the remap kernel (line 128) used size. One line we missed when we previously audited size→cosize.
Hypotheses Investigated
1. ❌ NaN/Inf in GEMM
Ruled out. All outputs finite, no NaN detected at any stage.
2. ❌ Weight shape mismatch
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
3. ❌ Global scale folding precision loss
Previously identified (commit da5572f). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause).
4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.
5. ❌ E2M1 packing convention mismatch
Investigated but ruled out. Both stage_activation and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.
6. 🔍 Attention output corruption from o_a_proj quantization
Status: Deferred. The checkpoint has o_a_proj.weight as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.
7. ✅ BF16 reference comparison — COSINE ≈ 0
Status: CONFIRMED. Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.
[TP0] cosine=-0.001789 mse=1.0201e+01 nvfp4_amax=8.5625 ref_amax=8.0000
[TP1] cosine= 0.030470 mse=1.0157e+01 nvfp4_amax=8.0625 ref_amax=8.3125
[TP2] cosine=-0.009217 mse=9.5978e+00 nvfp4_amax=9.1875 ref_amax=7.5312
[TP3] cosine= 0.001786 mse=9.4161e+00 nvfp4_amax=8.6875 ref_amax=8.8750
[TP4] cosine= 0.007108 mse=7.5709e+00 nvfp4_amax=7.3125 ref_amax=8.8750
[TP5] cosine=-0.000572 mse=7.8290e+00 nvfp4_amax=7.5938 ref_amax=10.562
[TP6] cosine= 0.012143 mse=9.2720e+00 nvfp4_amax=8.0000 ref_amax=8.1250
[TP7] cosine=-0.010009 mse=9.0296e+00 nvfp4_amax=6.6250 ref_amax=36.500
8. ✅ CUTLASS SF remap size vs cosize bug — ROOT CAUSE
Status: FIXED (commit c384198). The SF remap kernel iterated over cute::size() (logical) instead of cute::cosize() (physical with tile padding). Tile-padding positions in the CUTLASS interleaved SF layout were never written and stayed zero. With uniform SF (all-ones test) the bug was invisible. With non-uniform SF (real data) it produced cosine ≈ 0.
How we proved it:
- All-ones GEMM test (M=1, N=32, K=32): cosine = 1.0
- Random data GEMM test (M=1, N=32, K=32): cosine ≈ 0.2
- Random data sweep (multiple dimensions): cosine ≈ 0 everywhere
- The only difference: uniform vs non-uniform SF values → SF remap is the culprit
- Found
cute::sizeon line 128 when comment explicitly said usecute::cosize
Key Commits
| Commit | Description |
|---|---|
da5572f |
Stop folding global scale into float8 block scales (precision loss fix) |
d0ed3d8 |
Add L2, SiLU, and scatter pipeline prints |
995589a |
Add FP4 quantization round-trip diagnostic |
c421a66 |
Add BF16 reference GEMM + cosine comparison for L1 |
2fd55a9 |
Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
9159cb6 |
Add DEBUG_LOG.md documentation |
de8acc7 |
Dump raw GEMM inputs + first 8 output values |
755f9ad |
Fix per_expert_alpha ref + clean up BF16 reference scaling |
df916b8 |
Fix gs.item() for multi-element tensor |
7739674 |
Fix gs scalar conversion with .cpu().tolist() + add traceback |
1b63a46 |
Update DEBUG_LOG with cosine≈0 finding |
fee5a97 |
Fix cosine_similarity dim for M>0 |
f9330a1 |
Standalone M=1 GEMM test with deterministic data |
363dd89 |
Dimension sweep to isolate GEMM bug |
60f7f60 |
Ultra-minimal GEMM with all-ones (cosine=1.0!) |
67dcfa8 |
Random data at small dims + alpha sweep |
c384198 |
FIX: SF remap uses cute::cosize() instead of cute::size() |
Bugs Fixed During This Debug Session
🔴 ROOT CAUSE: SF remap size vs cosize (commit c384198)
Bug: In cutlass_nvfp4_gemm.cu line 128, the SF remap kernel used cute::size(layout_sf) as the iteration bound instead of cute::cosize(layout_sf). The size returns the logical element count; cosize returns the physical size including tile padding. The destination buffer was correctly allocated with cosize elements and zero-initialized, but the kernel only wrote to size positions, leaving tile-padding positions as zero.
Why it was missed in the previous audit: We changed all allocation sites from size to cosize (lines 179, 180, 232, 246, 287). The comment on lines 114-115 explicitly warned about this. But the iteration bound in the remap kernel itself (line 128) was overlooked — it was a different context (kernel launch parameter, not buffer allocation).
Why the standalone test passed: The previous standalone test used a single global scale for all blocks, producing uniform SF values. When all SF values are identical, missing writes don't matter — every position gets the same value regardless of which positions are written. The all-ones test in this session (M=1, N=32, K=32, cosine=1.0) confirmed this.
Fix: int total = cute::size(layout_sf); → int total = cute::cosize(layout_sf);
Impact: This was the root cause of all garbage output. Every GEMM call with non-uniform SF values was producing scrambled results.
Weight nibble unpack reshape bug (commit 2fd55a9)
Bug: In the BF16 reference diagnostic, reshape(K_half, -1) on 2D weight flattened N dimension.
Fix: reshape(K_half*2, N).
Impact: Only diagnostic code.
BF16 reference diagnostic: multiple bugs (commits c421a66→7739674)
- Weight reshape:
reshape(K_half, -1)→reshape(K_half*2, N) - per_expert_alpha not defined: reference code ran before alpha was computed
- gs.item() on multi-element tensor:
gsis shape (2,); fixed with.cpu().tolist() - igs double-count: multiplying by igs in both x_bf16 and final output
Impact: All bugs only in diagnostic code.
Architecture Notes
DeepSeek-V4 MoE Layer Forward Pass
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)
residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x) ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
NVFP4 MoE Pipeline
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
Checkpoint Layers (layer 0)
- MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
- o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
- o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
- Gate weight, norms, sinks, position_bias: BF16
Next Steps
- Rebuild container with cosize fix — Mike rebuilds with commit
c384198 - Run deterministic prompt — "The capital of France is" should produce "Paris"
- Run standalone random GEMM test — should now show cosine ≈ 1.0 with random data
- If output is still off: investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)
- Once working: clean up debug prints from production code