nvfp4-megamoe-kernel/DEBUG_LOG.md

# NVFP4 MegaMoE Debug Log

## Current State (May 15, 2026)

**Status:** Second root cause identified — SF remap coordinate extraction has M/K swapped. Awaiting rebuild and test.

### Root Cause #1 (partially fixed): `cute::size` vs `cute::cosize` (commit `c384198`)

The SF remap kernel used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. This left tile-padding positions unwritten (zero). Fix: one-line change `size` → `cosize`. However, this fix alone did NOT resolve the cosine ≈ 0 problem — random data still produced garbage.

### Root Cause #2 (current): M/K coordinates swapped in SF remap (commit `deb6b32`)

After the cosize fix failed to resolve the issue, we ran deeper diagnostics:
- **All-ones test (M=1, N=32, K=32):** cosine = 1.0 ✅ (uniform SF masks any coordinate bug)
- **Random data (same dimensions):** cosine ≈ 0.2 ❌
- **Isolated SFA and SFB remap:** both broken (cosine 0.16 and 0.21 respectively)

The remap kernel's coordinate extraction assumed `get<0..2>` = M group and `get<4..5>` = K group. But analysis of the CUTLASS `Sm1xxBlockScaledConfig` layout reveals the opposite: the SfAtom is K-major with `Step<_2,_1>`, meaning the first atom dimension tiles along K (problem dim 1) and the second tiles along M (problem dim 0). So `get<0..2>` = K group, `get<3..5>` = M group.

**Previous (wrong):**
```cpp
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;
```

**Fixed (commit `deb6b32`):**
```cpp
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
```

Also added printf diagnostics in the remap kernel to print the first 10 coordinate mappings, so we can verify the extraction at runtime.

**Why the M/K swap produces cosine ≈ 0 instead of just a permuted output:** The source SF data is row-major `(M, K_sf)` for SFA. If we read `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, and the wrong indices don't correspond to valid source positions, we get completely unrelated SF values. This corrupts the per-block scaling, making the GEMM output essentially random relative to the correct answer.

## How We Found It

### Step 1: Pipeline trace
Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.

### Step 2: BF16 reference comparison
Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. **Result: cosine ≈ 0** across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.

### Step 3: Standalone GEMM tests
- **All-ones data** (M=1, N=32, K=32): cosine = 1.0 ✅
- **Random data** (M=1, N=32, K=32): cosine ≈ 0.2 ❌
- **Random data** (M=128, N=6144, K=7168): cosine ≈ 0 ❌

The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.

### Step 4: Found the bug
The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used `cosize` correctly. But the **iteration bound** in the remap kernel (line 128) used `size`. One line we missed when we previously audited size→cosize.

## Hypotheses Investigated

### 1. ❌ NaN/Inf in GEMM
Ruled out. All outputs finite, no NaN detected at any stage.

### 2. ❌ Weight shape mismatch
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).

### 3. ❌ Global scale folding precision loss
Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause).

### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.

### 5. ❌ E2M1 packing convention mismatch
Investigated but ruled out. Both `stage_activation` and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.

### 6. 🔍 Attention output corruption from o_a_proj quantization
**Status: Deferred.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.

### 7. ✅ BF16 reference comparison — COSINE ≈ 0
**Status: CONFIRMED.** Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.

```
[TP0] cosine=-0.001789  mse=1.0201e+01  nvfp4_amax=8.5625  ref_amax=8.0000
[TP1] cosine= 0.030470  mse=1.0157e+01  nvfp4_amax=8.0625  ref_amax=8.3125
[TP2] cosine=-0.009217  mse=9.5978e+00  nvfp4_amax=9.1875  ref_amax=7.5312
[TP3] cosine= 0.001786  mse=9.4161e+00  nvfp4_amax=8.6875  ref_amax=8.8750
[TP4] cosine= 0.007108  mse=7.5709e+00  nvfp4_amax=7.3125  ref_amax=8.8750
[TP5] cosine=-0.000572  mse=7.8290e+00  nvfp4_amax=7.5938  ref_amax=10.562
[TP6] cosine= 0.012143  mse=9.2720e+00  nvfp4_amax=8.0000  ref_amax=8.1250
[TP7] cosine=-0.010009  mse=9.0296e+00  nvfp4_amax=6.6250  ref_amax=36.500
```

### 8. ✅ CUTLASS SF remap `size` vs `cosize` bug (commit `c384198`) — partial fix
**Status: Fixed but insufficient.** Changing `size` to `cosize` was necessary (tile-padding positions were unwritten) but did NOT resolve the cosine ≈ 0 problem. The real issue was the M/K swap in coordinate extraction (hypothesis #9).

### 9. ✅ SF remap M/K coordinate swap — ROOT CAUSE (commit `deb6b32`)
**Status: FIXED, awaiting rebuild verification.** The SF remap kernel had M and K coordinates swapped in the flattened coordinate extraction. The CUTLASS `Sm1xxBlockScaledConfig` uses a K-major SfAtom with `Step<_2,_1>`, meaning `get<0..2>` maps to the K dimension and `get<3..5>` maps to the M dimension. Our code had it backwards.

**How we proved it:**
1. `cosize` fix alone didn't resolve cosine ≈ 0
2. All-ones test (uniform SF) still passed — coordinate bugs are invisible with uniform data
3. Isolated SFA vs SFB: both broken (cosine 0.16, 0.21)
4. Analyzed CUTLASS source: `Sm1xxBlockScaledBasicChunk` uses `SfKMajorAtom` where first group = K, second = M
5. Added printf diagnostics to verify at runtime

## Key Commits

| Commit | Description |
|--------|-------------|
| `da5572f` | Stop folding global scale into float8 block scales (precision loss fix) |
| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints |
| `995589a` | Add FP4 quantization round-trip diagnostic |
| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 |
| `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
| `9159cb6` | Add DEBUG_LOG.md documentation |
| `de8acc7` | Dump raw GEMM inputs + first 8 output values |
| `755f9ad` | Fix per_expert_alpha ref + clean up BF16 reference scaling |
| `df916b8` | Fix gs.item() for multi-element tensor |
| `7739674` | Fix gs scalar conversion with .cpu().tolist() + add traceback |
| `1b63a46` | Update DEBUG_LOG with cosine≈0 finding |
| `fee5a97` | Fix cosine_similarity dim for M>0 |
| `f9330a1` | Standalone M=1 GEMM test with deterministic data |
| `363dd89` | Dimension sweep to isolate GEMM bug |
| `60f7f60` | Ultra-minimal GEMM with all-ones (cosine=1.0!) |
| `67dcfa8` | Random data at small dims + alpha sweep |
| `c384198` | Fix: SF remap uses cute::cosize() instead of cute::size() |
| `deb6b32` | **FIX: Swap M/K in SF remap coordinate extraction + add printf diagnostics** |

## Bugs Fixed During This Debug Session

### 🔴 ROOT CAUSE: SF remap M/K coordinate swap (commit `deb6b32`)

**Bug:** The SF remap kernel in `cutlass_nvfp4_gemm.cu` had M and K coordinates swapped in the flattened coordinate extraction. The code assumed `get<0..2>` = M group and `get<4..5>` = K group, but the CUTLASS `SfKMajorAtom` layout has K first and M second (K-major, with `Step<_2,_1>` tiling).

**Previous (wrong):**
```cpp
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;
```

**Fixed:**
```cpp
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
```

**Why the original code looked correct:** The comment said `((32, 4, n_m_tiles), (16, 4, n_k_tiles))` — M first, K second. But this is the *logical* M/K assignment, not the *physical* flattened order. The actual CUTE layout for K-major SF puts the K group first in the flattened coordinate.

**Impact:** Every SF value was read from `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, producing completely unrelated scale factors. The GEMM output was essentially random (cosine ≈ 0).

### SF remap `size` vs `cosize` (commit `c384198`) — necessary but insufficient

**Bug:** Iteration bound used `cute::size` (logical) instead of `cute::cosize` (physical). Tile-padding positions were never written.

**Impact:** With uniform SF, invisible. With non-uniform SF, additional corruption on top of the M/K swap bug. Both fixes are needed.

### Weight nibble unpack reshape bug (commit `2fd55a9`)

**Bug:** In the BF16 reference diagnostic, `reshape(K_half, -1)` on 2D weight flattened N dimension.

**Fix:** `reshape(K_half*2, N)`.

**Impact:** Only diagnostic code.

### BF16 reference diagnostic: multiple bugs (commits `c421a66`→`7739674`)

1. **Weight reshape:** `reshape(K_half, -1)` → `reshape(K_half*2, N)`
2. **per_expert_alpha not defined:** reference code ran before alpha was computed
3. **gs.item() on multi-element tensor:** `gs` is shape (2,); fixed with `.cpu().tolist()`
4. **igs double-count:** multiplying by igs in both x_bf16 and final output

**Impact:** All bugs only in diagnostic code.

## Architecture Notes

### DeepSeek-V4 MoE Layer Forward Pass
```
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x)                          ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)

residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x)                           ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
```

### NVFP4 MoE Pipeline
```
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
```

### Checkpoint Layers (layer 0)
- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj)
- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt
- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4
- **Gate weight, norms, sinks, position_bias:** BF16

## Next Steps

1. **Rebuild container with M/K swap fix** — Mike rebuilds with commit `deb6b32`
2. **Run standalone random GEMM test** — should now show cosine ≈ 1.0 with random data
3. **Check printf diagnostics** — verify the coordinate mapping is correct
4. **Run deterministic prompt** — "The capital of France is" should produce "Paris"
5. **If output is still off:** the M/K swap fix may need refinement — the `m` stride calculation (`InputSFVectorSize * 4`) may not be correct for all cases
6. **Once working:** remove printf diagnostics from production code, clean up debug prints
7. **Quality optimization:** investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)