12 KiB
NVFP4 MegaMoE Debug Log
Current State (May 15, 2026)
Status: Second root cause identified — SF remap coordinate extraction has M/K swapped. Awaiting rebuild and test.
Root Cause #1 (partially fixed): cute::size vs cute::cosize (commit c384198)
The SF remap kernel used cute::size(layout_sf) as the iteration bound instead of cute::cosize(layout_sf). This left tile-padding positions unwritten (zero). Fix: one-line change size → cosize. However, this fix alone did NOT resolve the cosine ≈ 0 problem — random data still produced garbage.
Root Cause #2 (current): M/K coordinates swapped in SF remap (commit deb6b32)
After the cosize fix failed to resolve the issue, we ran deeper diagnostics:
- All-ones test (M=1, N=32, K=32): cosine = 1.0 ✅ (uniform SF masks any coordinate bug)
- Random data (same dimensions): cosine ≈ 0.2 ❌
- Isolated SFA and SFB remap: both broken (cosine 0.16 and 0.21 respectively)
The remap kernel's coordinate extraction assumed get<0..2> = M group and get<4..5> = K group. But analysis of the CUTLASS Sm1xxBlockScaledConfig layout reveals the opposite: the SfAtom is K-major with Step<_2,_1>, meaning the first atom dimension tiles along K (problem dim 1) and the second tiles along M (problem dim 0). So get<0..2> = K group, get<3..5> = M group.
Previous (wrong):
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;
Fixed (commit deb6b32):
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
Also added printf diagnostics in the remap kernel to print the first 10 coordinate mappings, so we can verify the extraction at runtime.
Why the M/K swap produces cosine ≈ 0 instead of just a permuted output: The source SF data is row-major (M, K_sf) for SFA. If we read src[wrong_m * K_sf + wrong_k_sf] instead of src[m * K_sf + k_sf], and the wrong indices don't correspond to valid source positions, we get completely unrelated SF values. This corrupts the per-block scaling, making the GEMM output essentially random relative to the correct answer.
How We Found It
Step 1: Pipeline trace
Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.
Step 2: BF16 reference comparison
Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. Result: cosine ≈ 0 across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.
Step 3: Standalone GEMM tests
- All-ones data (M=1, N=32, K=32): cosine = 1.0 ✅
- Random data (M=1, N=32, K=32): cosine ≈ 0.2 ❌
- Random data (M=128, N=6144, K=7168): cosine ≈ 0 ❌
The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.
Step 4: Found the bug
The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used cosize correctly. But the iteration bound in the remap kernel (line 128) used size. One line we missed when we previously audited size→cosize.
Hypotheses Investigated
1. ❌ NaN/Inf in GEMM
Ruled out. All outputs finite, no NaN detected at any stage.
2. ❌ Weight shape mismatch
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
3. ❌ Global scale folding precision loss
Previously identified (commit da5572f). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause).
4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.
5. ❌ E2M1 packing convention mismatch
Investigated but ruled out. Both stage_activation and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.
6. 🔍 Attention output corruption from o_a_proj quantization
Status: Deferred. The checkpoint has o_a_proj.weight as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.
7. ✅ BF16 reference comparison — COSINE ≈ 0
Status: CONFIRMED. Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.
[TP0] cosine=-0.001789 mse=1.0201e+01 nvfp4_amax=8.5625 ref_amax=8.0000
[TP1] cosine= 0.030470 mse=1.0157e+01 nvfp4_amax=8.0625 ref_amax=8.3125
[TP2] cosine=-0.009217 mse=9.5978e+00 nvfp4_amax=9.1875 ref_amax=7.5312
[TP3] cosine= 0.001786 mse=9.4161e+00 nvfp4_amax=8.6875 ref_amax=8.8750
[TP4] cosine= 0.007108 mse=7.5709e+00 nvfp4_amax=7.3125 ref_amax=8.8750
[TP5] cosine=-0.000572 mse=7.8290e+00 nvfp4_amax=7.5938 ref_amax=10.562
[TP6] cosine= 0.012143 mse=9.2720e+00 nvfp4_amax=8.0000 ref_amax=8.1250
[TP7] cosine=-0.010009 mse=9.0296e+00 nvfp4_amax=6.6250 ref_amax=36.500
8. ✅ CUTLASS SF remap size vs cosize bug (commit c384198) — partial fix
Status: Fixed but insufficient. Changing size to cosize was necessary (tile-padding positions were unwritten) but did NOT resolve the cosine ≈ 0 problem. The real issue was the M/K swap in coordinate extraction (hypothesis #9).
9. ✅ SF remap M/K coordinate swap — ROOT CAUSE (commit deb6b32)
Status: FIXED, awaiting rebuild verification. The SF remap kernel had M and K coordinates swapped in the flattened coordinate extraction. The CUTLASS Sm1xxBlockScaledConfig uses a K-major SfAtom with Step<_2,_1>, meaning get<0..2> maps to the K dimension and get<3..5> maps to the M dimension. Our code had it backwards.
How we proved it:
cosizefix alone didn't resolve cosine ≈ 0- All-ones test (uniform SF) still passed — coordinate bugs are invisible with uniform data
- Isolated SFA vs SFB: both broken (cosine 0.16, 0.21)
- Analyzed CUTLASS source:
Sm1xxBlockScaledBasicChunkusesSfKMajorAtomwhere first group = K, second = M - Added printf diagnostics to verify at runtime
Key Commits
| Commit | Description |
|---|---|
da5572f |
Stop folding global scale into float8 block scales (precision loss fix) |
d0ed3d8 |
Add L2, SiLU, and scatter pipeline prints |
995589a |
Add FP4 quantization round-trip diagnostic |
c421a66 |
Add BF16 reference GEMM + cosine comparison for L1 |
2fd55a9 |
Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
9159cb6 |
Add DEBUG_LOG.md documentation |
de8acc7 |
Dump raw GEMM inputs + first 8 output values |
755f9ad |
Fix per_expert_alpha ref + clean up BF16 reference scaling |
df916b8 |
Fix gs.item() for multi-element tensor |
7739674 |
Fix gs scalar conversion with .cpu().tolist() + add traceback |
1b63a46 |
Update DEBUG_LOG with cosine≈0 finding |
fee5a97 |
Fix cosine_similarity dim for M>0 |
f9330a1 |
Standalone M=1 GEMM test with deterministic data |
363dd89 |
Dimension sweep to isolate GEMM bug |
60f7f60 |
Ultra-minimal GEMM with all-ones (cosine=1.0!) |
67dcfa8 |
Random data at small dims + alpha sweep |
c384198 |
Fix: SF remap uses cute::cosize() instead of cute::size() |
deb6b32 |
FIX: Swap M/K in SF remap coordinate extraction + add printf diagnostics |
Bugs Fixed During This Debug Session
🔴 ROOT CAUSE: SF remap M/K coordinate swap (commit deb6b32)
Bug: The SF remap kernel in cutlass_nvfp4_gemm.cu had M and K coordinates swapped in the flattened coordinate extraction. The code assumed get<0..2> = M group and get<4..5> = K group, but the CUTLASS SfKMajorAtom layout has K first and M second (K-major, with Step<_2,_1> tiling).
Previous (wrong):
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;
Fixed:
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
Why the original code looked correct: The comment said ((32, 4, n_m_tiles), (16, 4, n_k_tiles)) — M first, K second. But this is the logical M/K assignment, not the physical flattened order. The actual CUTE layout for K-major SF puts the K group first in the flattened coordinate.
Impact: Every SF value was read from src[wrong_m * K_sf + wrong_k_sf] instead of src[m * K_sf + k_sf], producing completely unrelated scale factors. The GEMM output was essentially random (cosine ≈ 0).
SF remap size vs cosize (commit c384198) — necessary but insufficient
Bug: Iteration bound used cute::size (logical) instead of cute::cosize (physical). Tile-padding positions were never written.
Impact: With uniform SF, invisible. With non-uniform SF, additional corruption on top of the M/K swap bug. Both fixes are needed.
Weight nibble unpack reshape bug (commit 2fd55a9)
Bug: In the BF16 reference diagnostic, reshape(K_half, -1) on 2D weight flattened N dimension.
Fix: reshape(K_half*2, N).
Impact: Only diagnostic code.
BF16 reference diagnostic: multiple bugs (commits c421a66→7739674)
- Weight reshape:
reshape(K_half, -1)→reshape(K_half*2, N) - per_expert_alpha not defined: reference code ran before alpha was computed
- gs.item() on multi-element tensor:
gsis shape (2,); fixed with.cpu().tolist() - igs double-count: multiplying by igs in both x_bf16 and final output
Impact: All bugs only in diagnostic code.
Architecture Notes
DeepSeek-V4 MoE Layer Forward Pass
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)
residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x) ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
NVFP4 MoE Pipeline
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
Checkpoint Layers (layer 0)
- MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
- o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
- o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
- Gate weight, norms, sinks, position_bias: BF16
Next Steps
- Rebuild container with M/K swap fix — Mike rebuilds with commit
deb6b32 - Run standalone random GEMM test — should now show cosine ≈ 1.0 with random data
- Check printf diagnostics — verify the coordinate mapping is correct
- Run deterministic prompt — "The capital of France is" should produce "Paris"
- If output is still off: the M/K swap fix may need refinement — the
mstride calculation (InputSFVectorSize * 4) may not be correct for all cases - Once working: remove printf diagnostics from production code, clean up debug prints
- Quality optimization: investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)