Files
nvfp4-megamoe-kernel/DEBUG_LOG.md

212 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NVFP4 MegaMoE Debug Log
## Current State (May 15, 2026)
**Status:** Second root cause identified — SF remap coordinate extraction has M/K swapped. Awaiting rebuild and test.
### Root Cause #1 (partially fixed): `cute::size` vs `cute::cosize` (commit `c384198`)
The SF remap kernel used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. This left tile-padding positions unwritten (zero). Fix: one-line change `size``cosize`. However, this fix alone did NOT resolve the cosine ≈ 0 problem — random data still produced garbage.
### Root Cause #2 (current): M/K coordinates swapped in SF remap (commit `deb6b32`)
After the cosize fix failed to resolve the issue, we ran deeper diagnostics:
- **All-ones test (M=1, N=32, K=32):** cosine = 1.0 ✅ (uniform SF masks any coordinate bug)
- **Random data (same dimensions):** cosine ≈ 0.2 ❌
- **Isolated SFA and SFB remap:** both broken (cosine 0.16 and 0.21 respectively)
The remap kernel's coordinate extraction assumed `get<0..2>` = M group and `get<4..5>` = K group. But analysis of the CUTLASS `Sm1xxBlockScaledConfig` layout reveals the opposite: the SfAtom is K-major with `Step<_2,_1>`, meaning the first atom dimension tiles along K (problem dim 1) and the second tiles along M (problem dim 0). So `get<0..2>` = K group, `get<3..5>` = M group.
**Previous (wrong):**
```cpp
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;
```
**Fixed (commit `deb6b32`):**
```cpp
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
```
Also added printf diagnostics in the remap kernel to print the first 10 coordinate mappings, so we can verify the extraction at runtime.
**Why the M/K swap produces cosine ≈ 0 instead of just a permuted output:** The source SF data is row-major `(M, K_sf)` for SFA. If we read `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, and the wrong indices don't correspond to valid source positions, we get completely unrelated SF values. This corrupts the per-block scaling, making the GEMM output essentially random relative to the correct answer.
## How We Found It
### Step 1: Pipeline trace
Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.
### Step 2: BF16 reference comparison
Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. **Result: cosine ≈ 0** across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.
### Step 3: Standalone GEMM tests
- **All-ones data** (M=1, N=32, K=32): cosine = 1.0 ✅
- **Random data** (M=1, N=32, K=32): cosine ≈ 0.2 ❌
- **Random data** (M=128, N=6144, K=7168): cosine ≈ 0 ❌
The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.
### Step 4: Found the bug
The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used `cosize` correctly. But the **iteration bound** in the remap kernel (line 128) used `size`. One line we missed when we previously audited size→cosize.
## Hypotheses Investigated
### 1. ❌ NaN/Inf in GEMM
Ruled out. All outputs finite, no NaN detected at any stage.
### 2. ❌ Weight shape mismatch
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
### 3. ❌ Global scale folding precision loss
Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause).
### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.
### 5. ❌ E2M1 packing convention mismatch
Investigated but ruled out. Both `stage_activation` and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.
### 6. 🔍 Attention output corruption from o_a_proj quantization
**Status: Deferred.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.
### 7. ✅ BF16 reference comparison — COSINE ≈ 0
**Status: CONFIRMED.** Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.
```
[TP0] cosine=-0.001789 mse=1.0201e+01 nvfp4_amax=8.5625 ref_amax=8.0000
[TP1] cosine= 0.030470 mse=1.0157e+01 nvfp4_amax=8.0625 ref_amax=8.3125
[TP2] cosine=-0.009217 mse=9.5978e+00 nvfp4_amax=9.1875 ref_amax=7.5312
[TP3] cosine= 0.001786 mse=9.4161e+00 nvfp4_amax=8.6875 ref_amax=8.8750
[TP4] cosine= 0.007108 mse=7.5709e+00 nvfp4_amax=7.3125 ref_amax=8.8750
[TP5] cosine=-0.000572 mse=7.8290e+00 nvfp4_amax=7.5938 ref_amax=10.562
[TP6] cosine= 0.012143 mse=9.2720e+00 nvfp4_amax=8.0000 ref_amax=8.1250
[TP7] cosine=-0.010009 mse=9.0296e+00 nvfp4_amax=6.6250 ref_amax=36.500
```
### 8. ✅ CUTLASS SF remap `size` vs `cosize` bug (commit `c384198`) — partial fix
**Status: Fixed but insufficient.** Changing `size` to `cosize` was necessary (tile-padding positions were unwritten) but did NOT resolve the cosine ≈ 0 problem. The real issue was the M/K swap in coordinate extraction (hypothesis #9).
### 9. ✅ SF remap M/K coordinate swap — ROOT CAUSE (commit `deb6b32`)
**Status: FIXED, awaiting rebuild verification.** The SF remap kernel had M and K coordinates swapped in the flattened coordinate extraction. The CUTLASS `Sm1xxBlockScaledConfig` uses a K-major SfAtom with `Step<_2,_1>`, meaning `get<0..2>` maps to the K dimension and `get<3..5>` maps to the M dimension. Our code had it backwards.
**How we proved it:**
1. `cosize` fix alone didn't resolve cosine ≈ 0
2. All-ones test (uniform SF) still passed — coordinate bugs are invisible with uniform data
3. Isolated SFA vs SFB: both broken (cosine 0.16, 0.21)
4. Analyzed CUTLASS source: `Sm1xxBlockScaledBasicChunk` uses `SfKMajorAtom` where first group = K, second = M
5. Added printf diagnostics to verify at runtime
## Key Commits
| Commit | Description |
|--------|-------------|
| `da5572f` | Stop folding global scale into float8 block scales (precision loss fix) |
| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints |
| `995589a` | Add FP4 quantization round-trip diagnostic |
| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 |
| `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
| `9159cb6` | Add DEBUG_LOG.md documentation |
| `de8acc7` | Dump raw GEMM inputs + first 8 output values |
| `755f9ad` | Fix per_expert_alpha ref + clean up BF16 reference scaling |
| `df916b8` | Fix gs.item() for multi-element tensor |
| `7739674` | Fix gs scalar conversion with .cpu().tolist() + add traceback |
| `1b63a46` | Update DEBUG_LOG with cosine≈0 finding |
| `fee5a97` | Fix cosine_similarity dim for M>0 |
| `f9330a1` | Standalone M=1 GEMM test with deterministic data |
| `363dd89` | Dimension sweep to isolate GEMM bug |
| `60f7f60` | Ultra-minimal GEMM with all-ones (cosine=1.0!) |
| `67dcfa8` | Random data at small dims + alpha sweep |
| `c384198` | Fix: SF remap uses cute::cosize() instead of cute::size() |
| `deb6b32` | **FIX: Swap M/K in SF remap coordinate extraction + add printf diagnostics** |
## Bugs Fixed During This Debug Session
### 🔴 ROOT CAUSE: SF remap M/K coordinate swap (commit `deb6b32`)
**Bug:** The SF remap kernel in `cutlass_nvfp4_gemm.cu` had M and K coordinates swapped in the flattened coordinate extraction. The code assumed `get<0..2>` = M group and `get<4..5>` = K group, but the CUTLASS `SfKMajorAtom` layout has K first and M second (K-major, with `Step<_2,_1>` tiling).
**Previous (wrong):**
```cpp
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;
```
**Fixed:**
```cpp
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
```
**Why the original code looked correct:** The comment said `((32, 4, n_m_tiles), (16, 4, n_k_tiles))` — M first, K second. But this is the *logical* M/K assignment, not the *physical* flattened order. The actual CUTE layout for K-major SF puts the K group first in the flattened coordinate.
**Impact:** Every SF value was read from `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, producing completely unrelated scale factors. The GEMM output was essentially random (cosine ≈ 0).
### SF remap `size` vs `cosize` (commit `c384198`) — necessary but insufficient
**Bug:** Iteration bound used `cute::size` (logical) instead of `cute::cosize` (physical). Tile-padding positions were never written.
**Impact:** With uniform SF, invisible. With non-uniform SF, additional corruption on top of the M/K swap bug. Both fixes are needed.
### Weight nibble unpack reshape bug (commit `2fd55a9`)
**Bug:** In the BF16 reference diagnostic, `reshape(K_half, -1)` on 2D weight flattened N dimension.
**Fix:** `reshape(K_half*2, N)`.
**Impact:** Only diagnostic code.
### BF16 reference diagnostic: multiple bugs (commits `c421a66`→`7739674`)
1. **Weight reshape:** `reshape(K_half, -1)``reshape(K_half*2, N)`
2. **per_expert_alpha not defined:** reference code ran before alpha was computed
3. **gs.item() on multi-element tensor:** `gs` is shape (2,); fixed with `.cpu().tolist()`
4. **igs double-count:** multiplying by igs in both x_bf16 and final output
**Impact:** All bugs only in diagnostic code.
## Architecture Notes
### DeepSeek-V4 MoE Layer Forward Pass
```
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)
residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x) ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
```
### NVFP4 MoE Pipeline
```
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
```
### Checkpoint Layers (layer 0)
- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj)
- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt
- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4
- **Gate weight, norms, sinks, position_bias:** BF16
## Next Steps
1. **Rebuild container with M/K swap fix** — Mike rebuilds with commit `deb6b32`
2. **Run standalone random GEMM test** — should now show cosine ≈ 1.0 with random data
3. **Check printf diagnostics** — verify the coordinate mapping is correct
4. **Run deterministic prompt** — "The capital of France is" should produce "Paris"
5. **If output is still off:** the M/K swap fix may need refinement — the `m` stride calculation (`InputSFVectorSize * 4`) may not be correct for all cases
6. **Once working:** remove printf diagnostics from production code, clean up debug prints
7. **Quality optimization:** investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)