212 lines
12 KiB
Markdown
212 lines
12 KiB
Markdown
# NVFP4 MegaMoE Debug Log
|
||
|
||
## Current State (May 15, 2026)
|
||
|
||
**Status:** Second root cause identified — SF remap coordinate extraction has M/K swapped. Awaiting rebuild and test.
|
||
|
||
### Root Cause #1 (partially fixed): `cute::size` vs `cute::cosize` (commit `c384198`)
|
||
|
||
The SF remap kernel used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. This left tile-padding positions unwritten (zero). Fix: one-line change `size` → `cosize`. However, this fix alone did NOT resolve the cosine ≈ 0 problem — random data still produced garbage.
|
||
|
||
### Root Cause #2 (current): M/K coordinates swapped in SF remap (commit `deb6b32`)
|
||
|
||
After the cosize fix failed to resolve the issue, we ran deeper diagnostics:
|
||
- **All-ones test (M=1, N=32, K=32):** cosine = 1.0 ✅ (uniform SF masks any coordinate bug)
|
||
- **Random data (same dimensions):** cosine ≈ 0.2 ❌
|
||
- **Isolated SFA and SFB remap:** both broken (cosine 0.16 and 0.21 respectively)
|
||
|
||
The remap kernel's coordinate extraction assumed `get<0..2>` = M group and `get<4..5>` = K group. But analysis of the CUTLASS `Sm1xxBlockScaledConfig` layout reveals the opposite: the SfAtom is K-major with `Step<_2,_1>`, meaning the first atom dimension tiles along K (problem dim 1) and the second tiles along M (problem dim 0). So `get<0..2>` = K group, `get<3..5>` = M group.
|
||
|
||
**Previous (wrong):**
|
||
```cpp
|
||
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
|
||
k_sf = get<4>(flat) + get<5>(flat) * 4;
|
||
```
|
||
|
||
**Fixed (commit `deb6b32`):**
|
||
```cpp
|
||
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
|
||
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
|
||
```
|
||
|
||
Also added printf diagnostics in the remap kernel to print the first 10 coordinate mappings, so we can verify the extraction at runtime.
|
||
|
||
**Why the M/K swap produces cosine ≈ 0 instead of just a permuted output:** The source SF data is row-major `(M, K_sf)` for SFA. If we read `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, and the wrong indices don't correspond to valid source positions, we get completely unrelated SF values. This corrupts the per-block scaling, making the GEMM output essentially random relative to the correct answer.
|
||
|
||
## How We Found It
|
||
|
||
### Step 1: Pipeline trace
|
||
Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.
|
||
|
||
### Step 2: BF16 reference comparison
|
||
Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. **Result: cosine ≈ 0** across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.
|
||
|
||
### Step 3: Standalone GEMM tests
|
||
- **All-ones data** (M=1, N=32, K=32): cosine = 1.0 ✅
|
||
- **Random data** (M=1, N=32, K=32): cosine ≈ 0.2 ❌
|
||
- **Random data** (M=128, N=6144, K=7168): cosine ≈ 0 ❌
|
||
|
||
The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.
|
||
|
||
### Step 4: Found the bug
|
||
The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used `cosize` correctly. But the **iteration bound** in the remap kernel (line 128) used `size`. One line we missed when we previously audited size→cosize.
|
||
|
||
## Hypotheses Investigated
|
||
|
||
### 1. ❌ NaN/Inf in GEMM
|
||
Ruled out. All outputs finite, no NaN detected at any stage.
|
||
|
||
### 2. ❌ Weight shape mismatch
|
||
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
|
||
|
||
### 3. ❌ Global scale folding precision loss
|
||
Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause).
|
||
|
||
### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
|
||
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.
|
||
|
||
### 5. ❌ E2M1 packing convention mismatch
|
||
Investigated but ruled out. Both `stage_activation` and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.
|
||
|
||
### 6. 🔍 Attention output corruption from o_a_proj quantization
|
||
**Status: Deferred.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.
|
||
|
||
### 7. ✅ BF16 reference comparison — COSINE ≈ 0
|
||
**Status: CONFIRMED.** Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.
|
||
|
||
```
|
||
[TP0] cosine=-0.001789 mse=1.0201e+01 nvfp4_amax=8.5625 ref_amax=8.0000
|
||
[TP1] cosine= 0.030470 mse=1.0157e+01 nvfp4_amax=8.0625 ref_amax=8.3125
|
||
[TP2] cosine=-0.009217 mse=9.5978e+00 nvfp4_amax=9.1875 ref_amax=7.5312
|
||
[TP3] cosine= 0.001786 mse=9.4161e+00 nvfp4_amax=8.6875 ref_amax=8.8750
|
||
[TP4] cosine= 0.007108 mse=7.5709e+00 nvfp4_amax=7.3125 ref_amax=8.8750
|
||
[TP5] cosine=-0.000572 mse=7.8290e+00 nvfp4_amax=7.5938 ref_amax=10.562
|
||
[TP6] cosine= 0.012143 mse=9.2720e+00 nvfp4_amax=8.0000 ref_amax=8.1250
|
||
[TP7] cosine=-0.010009 mse=9.0296e+00 nvfp4_amax=6.6250 ref_amax=36.500
|
||
```
|
||
|
||
### 8. ✅ CUTLASS SF remap `size` vs `cosize` bug (commit `c384198`) — partial fix
|
||
**Status: Fixed but insufficient.** Changing `size` to `cosize` was necessary (tile-padding positions were unwritten) but did NOT resolve the cosine ≈ 0 problem. The real issue was the M/K swap in coordinate extraction (hypothesis #9).
|
||
|
||
### 9. ✅ SF remap M/K coordinate swap — ROOT CAUSE (commit `deb6b32`)
|
||
**Status: FIXED, awaiting rebuild verification.** The SF remap kernel had M and K coordinates swapped in the flattened coordinate extraction. The CUTLASS `Sm1xxBlockScaledConfig` uses a K-major SfAtom with `Step<_2,_1>`, meaning `get<0..2>` maps to the K dimension and `get<3..5>` maps to the M dimension. Our code had it backwards.
|
||
|
||
**How we proved it:**
|
||
1. `cosize` fix alone didn't resolve cosine ≈ 0
|
||
2. All-ones test (uniform SF) still passed — coordinate bugs are invisible with uniform data
|
||
3. Isolated SFA vs SFB: both broken (cosine 0.16, 0.21)
|
||
4. Analyzed CUTLASS source: `Sm1xxBlockScaledBasicChunk` uses `SfKMajorAtom` where first group = K, second = M
|
||
5. Added printf diagnostics to verify at runtime
|
||
|
||
## Key Commits
|
||
|
||
| Commit | Description |
|
||
|--------|-------------|
|
||
| `da5572f` | Stop folding global scale into float8 block scales (precision loss fix) |
|
||
| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints |
|
||
| `995589a` | Add FP4 quantization round-trip diagnostic |
|
||
| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 |
|
||
| `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
|
||
| `9159cb6` | Add DEBUG_LOG.md documentation |
|
||
| `de8acc7` | Dump raw GEMM inputs + first 8 output values |
|
||
| `755f9ad` | Fix per_expert_alpha ref + clean up BF16 reference scaling |
|
||
| `df916b8` | Fix gs.item() for multi-element tensor |
|
||
| `7739674` | Fix gs scalar conversion with .cpu().tolist() + add traceback |
|
||
| `1b63a46` | Update DEBUG_LOG with cosine≈0 finding |
|
||
| `fee5a97` | Fix cosine_similarity dim for M>0 |
|
||
| `f9330a1` | Standalone M=1 GEMM test with deterministic data |
|
||
| `363dd89` | Dimension sweep to isolate GEMM bug |
|
||
| `60f7f60` | Ultra-minimal GEMM with all-ones (cosine=1.0!) |
|
||
| `67dcfa8` | Random data at small dims + alpha sweep |
|
||
| `c384198` | Fix: SF remap uses cute::cosize() instead of cute::size() |
|
||
| `deb6b32` | **FIX: Swap M/K in SF remap coordinate extraction + add printf diagnostics** |
|
||
|
||
## Bugs Fixed During This Debug Session
|
||
|
||
### 🔴 ROOT CAUSE: SF remap M/K coordinate swap (commit `deb6b32`)
|
||
|
||
**Bug:** The SF remap kernel in `cutlass_nvfp4_gemm.cu` had M and K coordinates swapped in the flattened coordinate extraction. The code assumed `get<0..2>` = M group and `get<4..5>` = K group, but the CUTLASS `SfKMajorAtom` layout has K first and M second (K-major, with `Step<_2,_1>` tiling).
|
||
|
||
**Previous (wrong):**
|
||
```cpp
|
||
m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
|
||
k_sf = get<4>(flat) + get<5>(flat) * 4;
|
||
```
|
||
|
||
**Fixed:**
|
||
```cpp
|
||
k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
|
||
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);
|
||
```
|
||
|
||
**Why the original code looked correct:** The comment said `((32, 4, n_m_tiles), (16, 4, n_k_tiles))` — M first, K second. But this is the *logical* M/K assignment, not the *physical* flattened order. The actual CUTE layout for K-major SF puts the K group first in the flattened coordinate.
|
||
|
||
**Impact:** Every SF value was read from `src[wrong_m * K_sf + wrong_k_sf]` instead of `src[m * K_sf + k_sf]`, producing completely unrelated scale factors. The GEMM output was essentially random (cosine ≈ 0).
|
||
|
||
### SF remap `size` vs `cosize` (commit `c384198`) — necessary but insufficient
|
||
|
||
**Bug:** Iteration bound used `cute::size` (logical) instead of `cute::cosize` (physical). Tile-padding positions were never written.
|
||
|
||
**Impact:** With uniform SF, invisible. With non-uniform SF, additional corruption on top of the M/K swap bug. Both fixes are needed.
|
||
|
||
### Weight nibble unpack reshape bug (commit `2fd55a9`)
|
||
|
||
**Bug:** In the BF16 reference diagnostic, `reshape(K_half, -1)` on 2D weight flattened N dimension.
|
||
|
||
**Fix:** `reshape(K_half*2, N)`.
|
||
|
||
**Impact:** Only diagnostic code.
|
||
|
||
### BF16 reference diagnostic: multiple bugs (commits `c421a66`→`7739674`)
|
||
|
||
1. **Weight reshape:** `reshape(K_half, -1)` → `reshape(K_half*2, N)`
|
||
2. **per_expert_alpha not defined:** reference code ran before alpha was computed
|
||
3. **gs.item() on multi-element tensor:** `gs` is shape (2,); fixed with `.cpu().tolist()`
|
||
4. **igs double-count:** multiplying by igs in both x_bf16 and final output
|
||
|
||
**Impact:** All bugs only in diagnostic code.
|
||
|
||
## Architecture Notes
|
||
|
||
### DeepSeek-V4 MoE Layer Forward Pass
|
||
```
|
||
residual = x
|
||
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
|
||
x = attn_norm(x)
|
||
x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here
|
||
x = hc_post(x, residual, post, comb)
|
||
|
||
residual = x
|
||
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
|
||
x = ffn_norm(x)
|
||
x = ffn(x) ← Our NVFP4 mega_moe kernel
|
||
x = hc_post(x, residual, post, comb)
|
||
```
|
||
|
||
### NVFP4 MoE Pipeline
|
||
```
|
||
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
|
||
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
|
||
SiLU(gate) * up → activated
|
||
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
|
||
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
|
||
scatter with routing weights → y
|
||
```
|
||
|
||
### Checkpoint Layers (layer 0)
|
||
- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
|
||
- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj)
|
||
- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt
|
||
- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4
|
||
- **Gate weight, norms, sinks, position_bias:** BF16
|
||
|
||
## Next Steps
|
||
|
||
1. **Rebuild container with M/K swap fix** — Mike rebuilds with commit `deb6b32`
|
||
2. **Run standalone random GEMM test** — should now show cosine ≈ 1.0 with random data
|
||
3. **Check printf diagnostics** — verify the coordinate mapping is correct
|
||
4. **Run deterministic prompt** — "The capital of France is" should produce "Paris"
|
||
5. **If output is still off:** the M/K swap fix may need refinement — the `m` stride calculation (`InputSFVectorSize * 4`) may not be correct for all cases
|
||
6. **Once working:** remove printf diagnostics from production code, clean up debug prints
|
||
7. **Quality optimization:** investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)
|