# NVFP4 MegaMoE Debug Log ## Current State (May 15, 2026) **Status:** Second root cause identified — SF remap coordinate extraction has M/K swapped. Awaiting rebuild and test. ### Root Cause #1 (partially fixed): `cute::size` vs `cute::cosize` (commit `c384198`) The SF remap kernel used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. This left tile-padding positions unwritten (zero). Fix: one-line change `size` → `cosize`. However, this fix alone did NOT resolve the cosine ≈ 0 problem — random data still produced garbage. ### Root Cause #2 (current): M/K coordinates swapped in SF remap (commit `deb6b32`) After the cosize fix failed to resolve the issue, we ran deeper diagnostics: - **All-ones test (M=1, N=32, K=32):** cosine = 1.0 ✅ (uniform SF masks any coordinate bug) - **Random data (same dimensions):** cosine ≈ 0.2 ❌ - **Isolated SFA and SFB remap:** both broken (cosine 0.16 and 0.21 respectively) The remap kernel's coordinate extraction assumed `get<0..2>` = M group and `get<4..5>` = K group. But analysis of the CUTLASS `Sm1xxBlockScaledConfig` layout reveals the opposite: the SfAtom is K-major with `Step<_2,_1>`, meaning the first atom dimension tiles along K (problem dim 1) and the second tiles along M (problem dim 0). So `get<0..2>` = K group, `get<3..5>` = M group. **Previous (wrong):** ```cpp m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128; k_sf = get<4>(flat) + get<5>(flat) * 4; ``` **Fixed (commit `deb6b32`):** ```cpp k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128; m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4); ``` Also added printf diagnostics in the remap kernel to print the first 10 coordinate mappings, so we can verify the extraction at runtime. **Why the M/K swap produces cosine ≈ 0 instead of just a permuted output:** The source SF data is row-major `(M, K_sf)` for SFA. If we read `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, and the wrong indices don't correspond to valid source positions, we get completely unrelated SF values. This corrupts the per-block scaling, making the GEMM output essentially random relative to the correct answer. ## How We Found It ### Step 1: Pipeline trace Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried. ### Step 2: BF16 reference comparison Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. **Result: cosine ≈ 0** across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer. ### Step 3: Standalone GEMM tests - **All-ones data** (M=1, N=32, K=32): cosine = 1.0 ✅ - **Random data** (M=1, N=32, K=32): cosine ≈ 0.2 ❌ - **Random data** (M=128, N=6144, K=7168): cosine ≈ 0 ❌ The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values. ### Step 4: Found the bug The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used `cosize` correctly. But the **iteration bound** in the remap kernel (line 128) used `size`. One line we missed when we previously audited size→cosize. ## Hypotheses Investigated ### 1. ❌ NaN/Inf in GEMM Ruled out. All outputs finite, no NaN detected at any stage. ### 2. ❌ Weight shape mismatch Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168). ### 3. ❌ Global scale folding precision loss Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause). ### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED) Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data. ### 5. ❌ E2M1 packing convention mismatch Investigated but ruled out. Both `stage_activation` and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct. ### 6. 🔍 Attention output corruption from o_a_proj quantization **Status: Deferred.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed. ### 7. ✅ BF16 reference comparison — COSINE ≈ 0 **Status: CONFIRMED.** Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream. ``` [TP0] cosine=-0.001789 mse=1.0201e+01 nvfp4_amax=8.5625 ref_amax=8.0000 [TP1] cosine= 0.030470 mse=1.0157e+01 nvfp4_amax=8.0625 ref_amax=8.3125 [TP2] cosine=-0.009217 mse=9.5978e+00 nvfp4_amax=9.1875 ref_amax=7.5312 [TP3] cosine= 0.001786 mse=9.4161e+00 nvfp4_amax=8.6875 ref_amax=8.8750 [TP4] cosine= 0.007108 mse=7.5709e+00 nvfp4_amax=7.3125 ref_amax=8.8750 [TP5] cosine=-0.000572 mse=7.8290e+00 nvfp4_amax=7.5938 ref_amax=10.562 [TP6] cosine= 0.012143 mse=9.2720e+00 nvfp4_amax=8.0000 ref_amax=8.1250 [TP7] cosine=-0.010009 mse=9.0296e+00 nvfp4_amax=6.6250 ref_amax=36.500 ``` ### 8. ✅ CUTLASS SF remap `size` vs `cosize` bug (commit `c384198`) — partial fix **Status: Fixed but insufficient.** Changing `size` to `cosize` was necessary (tile-padding positions were unwritten) but did NOT resolve the cosine ≈ 0 problem. The real issue was the M/K swap in coordinate extraction (hypothesis #9). ### 9. ✅ SF remap M/K coordinate swap — ROOT CAUSE (commit `deb6b32`) **Status: FIXED, awaiting rebuild verification.** The SF remap kernel had M and K coordinates swapped in the flattened coordinate extraction. The CUTLASS `Sm1xxBlockScaledConfig` uses a K-major SfAtom with `Step<_2,_1>`, meaning `get<0..2>` maps to the K dimension and `get<3..5>` maps to the M dimension. Our code had it backwards. **How we proved it:** 1. `cosize` fix alone didn't resolve cosine ≈ 0 2. All-ones test (uniform SF) still passed — coordinate bugs are invisible with uniform data 3. Isolated SFA vs SFB: both broken (cosine 0.16, 0.21) 4. Analyzed CUTLASS source: `Sm1xxBlockScaledBasicChunk` uses `SfKMajorAtom` where first group = K, second = M 5. Added printf diagnostics to verify at runtime ## Key Commits | Commit | Description | |--------|-------------| | `da5572f` | Stop folding global scale into float8 block scales (precision loss fix) | | `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints | | `995589a` | Add FP4 quantization round-trip diagnostic | | `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 | | `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count | | `9159cb6` | Add DEBUG_LOG.md documentation | | `de8acc7` | Dump raw GEMM inputs + first 8 output values | | `755f9ad` | Fix per_expert_alpha ref + clean up BF16 reference scaling | | `df916b8` | Fix gs.item() for multi-element tensor | | `7739674` | Fix gs scalar conversion with .cpu().tolist() + add traceback | | `1b63a46` | Update DEBUG_LOG with cosine≈0 finding | | `fee5a97` | Fix cosine_similarity dim for M>0 | | `f9330a1` | Standalone M=1 GEMM test with deterministic data | | `363dd89` | Dimension sweep to isolate GEMM bug | | `60f7f60` | Ultra-minimal GEMM with all-ones (cosine=1.0!) | | `67dcfa8` | Random data at small dims + alpha sweep | | `c384198` | Fix: SF remap uses cute::cosize() instead of cute::size() | | `deb6b32` | **FIX: Swap M/K in SF remap coordinate extraction + add printf diagnostics** | ## Bugs Fixed During This Debug Session ### 🔴 ROOT CAUSE: SF remap M/K coordinate swap (commit `deb6b32`) **Bug:** The SF remap kernel in `cutlass_nvfp4_gemm.cu` had M and K coordinates swapped in the flattened coordinate extraction. The code assumed `get<0..2>` = M group and `get<4..5>` = K group, but the CUTLASS `SfKMajorAtom` layout has K first and M second (K-major, with `Step<_2,_1>` tiling). **Previous (wrong):** ```cpp m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128; k_sf = get<4>(flat) + get<5>(flat) * 4; ``` **Fixed:** ```cpp k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128; m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4); ``` **Why the original code looked correct:** The comment said `((32, 4, n_m_tiles), (16, 4, n_k_tiles))` — M first, K second. But this is the *logical* M/K assignment, not the *physical* flattened order. The actual CUTE layout for K-major SF puts the K group first in the flattened coordinate. **Impact:** Every SF value was read from `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, producing completely unrelated scale factors. The GEMM output was essentially random (cosine ≈ 0). ### SF remap `size` vs `cosize` (commit `c384198`) — necessary but insufficient **Bug:** Iteration bound used `cute::size` (logical) instead of `cute::cosize` (physical). Tile-padding positions were never written. **Impact:** With uniform SF, invisible. With non-uniform SF, additional corruption on top of the M/K swap bug. Both fixes are needed. ### Weight nibble unpack reshape bug (commit `2fd55a9`) **Bug:** In the BF16 reference diagnostic, `reshape(K_half, -1)` on 2D weight flattened N dimension. **Fix:** `reshape(K_half*2, N)`. **Impact:** Only diagnostic code. ### BF16 reference diagnostic: multiple bugs (commits `c421a66`→`7739674`) 1. **Weight reshape:** `reshape(K_half, -1)` → `reshape(K_half*2, N)` 2. **per_expert_alpha not defined:** reference code ran before alpha was computed 3. **gs.item() on multi-element tensor:** `gs` is shape (2,); fixed with `.cpu().tolist()` 4. **igs double-count:** multiplying by igs in both x_bf16 and final output **Impact:** All bugs only in diagnostic code. ## Architecture Notes ### DeepSeek-V4 MoE Layer Forward Pass ``` residual = x x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base) x = attn_norm(x) x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here x = hc_post(x, residual, post, comb) residual = x x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base) x = ffn_norm(x) x = ffn(x) ← Our NVFP4 mega_moe kernel x = hc_post(x, residual, post, comb) ``` ### NVFP4 MoE Pipeline ``` stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up SiLU(gate) * up → activated stage_activation(activated) → l1_fp4, l1_sf, l1_igs L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output scatter with routing weights → y ``` ### Checkpoint Layers (layer 0) - **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale) - **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj) - **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt - **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4 - **Gate weight, norms, sinks, position_bias:** BF16 ## Next Steps 1. **Rebuild container with M/K swap fix** — Mike rebuilds with commit `deb6b32` 2. **Run standalone random GEMM test** — should now show cosine ≈ 1.0 with random data 3. **Check printf diagnostics** — verify the coordinate mapping is correct 4. **Run deterministic prompt** — "The capital of France is" should produce "Paris" 5. **If output is still off:** the M/K swap fix may need refinement — the `m` stride calculation (`InputSFVectorSize * 4`) may not be correct for all cases 6. **Once working:** remove printf diagnostics from production code, clean up debug prints 7. **Quality optimization:** investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)