docs: update DEBUG_LOG with root cause (size→cosize) and full debug timeline

This commit is contained in:
2026-05-15 18:56:09 +00:00
parent c3841983a0
commit fa7b394571

View File

@@ -2,31 +2,37 @@
## Current State (May 15, 2026)
**Status:** Model produces garbage output. Deterministic prompt "The capital of France is" produces `-W'MSG173 ~SB…abych` instead of "Paris".
**Status:** Root cause identified and fixed. Awaiting rebuild and test.
## Symptoms
**Root cause:** The SF (scale factor) remap kernel in `cutlass_nvfp4_gemm.cu` used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. The `size` returns the logical size; `cosize` returns the physical size including tile padding. The destination buffer was allocated with `cosize` elements (correct) and zero-initialized, but the kernel only iterated over `size` elements (incorrect), leaving tile-padding positions as zero instead of their actual SF values.
- No NaN/Inf anywhere in the pipeline
- Magnitudes at each stage look reasonable:
- L1 GEMM output: amax ~8-10
- SiLU activation: amax ~34-43
- L2 GEMM output: amax ~17-28
- Scatter output: amax ~5-15
- FP4 activation quantization round-trip: reconstructed amax ~3.86, igs=1.4e-3
- All 8 TP ranks produce identical weight shapes after transformation
- Experts have distinct weights and scales (not duplicated)
**Why it was invisible in the all-ones test:** When all SF values are identical (uniform data), missing writes don't matter — every position should have the same value, and the ones that got written have the right one. The standalone test from the previous session used a single global scale for all blocks, producing uniform SF, which is why it showed cosine 1.0.
**Key observation:** The signal is there but buried. "Paris" appears at rank 2075/129280 — the model knows the word exists (logit 9.25) but top logits point at garbage tokens. This suggests a systematic error that preserves magnitude but distorts direction.
**Why it broke with real data:** Different blocks have different SF values. The tile-padding positions in the CUTLASS interleaved SF layout need specific SF values, but they were left as zero. CUTLASS reads those positions during the GEMM, getting zero scales instead of the correct values, which scrambles the output direction while preserving approximate magnitude.
## Pipeline Trace (per layer, from last inference)
**Fix:** One-line change in `cutlass_nvfp4_gemm.cu` line 128: `cute::size``cute::cosize` (commit `c384198`).
```
[L1-GEMM-OUT] slots=1 N=6144 amax=8.6250e+00
[L1-SPLIT] gate amax=7.1250e+00 | up amax=8.6250e+00
[SILU-ACT] amax=3.4500e+01
[L2-GEMM-OUT] slots=1 N=7168 amax=1.8500e+01
[SCATTER] y amax=6.7500e+00 slots=1
```
**Original symptoms:**
- Deterministic prompt "The capital of France is" → `-W'MSG173 ~SB…abych` instead of "Paris"
- No NaN/Inf, magnitudes reasonable, but cosine similarity ≈ 0 between NVFP4 GEMM and BF16 reference
## How We Found It
### Step 1: Pipeline trace
Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.
### Step 2: BF16 reference comparison
Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. **Result: cosine ≈ 0** across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.
### Step 3: Standalone GEMM tests
- **All-ones data** (M=1, N=32, K=32): cosine = 1.0 ✅
- **Random data** (M=1, N=32, K=32): cosine ≈ 0.2 ❌
- **Random data** (M=128, N=6144, K=7168): cosine ≈ 0 ❌
The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.
### Step 4: Found the bug
The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used `cosize` correctly. But the **iteration bound** in the remap kernel (line 128) used `size`. One line we missed when we previously audited size→cosize.
## Hypotheses Investigated
@@ -37,30 +43,20 @@ Ruled out. All outputs finite, no NaN detected at any stage.
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
### 3. ❌ Global scale folding precision loss
Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision in low-precision zone. Fixed by passing global scales as per-expert alpha instead of folding. Did not fix the garbage output.
Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause).
### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation (commit history). Standalone test shows cosine=1.0 and MSE=0.0 for random data.
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.
### 5. 🔍 E2M1 packing convention mismatch
**Status: Open.** The CUTLASS kernel expects `nv_float4_t<float_e2m1_t>` packed as 2 nibbles per byte. Our `stage_activation` packs `(nibbles[..., 1] << 4) | nibbles[..., 0]` (even→low, odd→high). The checkpoint weights use the same convention. The standalone test showed cosine 1.0 with this packing, but both A and B were packed the same way — if both are wrong in the same way, the error cancels.
### 5. E2M1 packing convention mismatch
Investigated but ruled out. Both `stage_activation` and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.
### 6. 🔍 Attention output corruption from o_a_proj quantization
**Status: Active investigation.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time because the model parameter is declared uint8. This is a lossy conversion of a 64M-parameter matrix that sits right before the MoE. If the quantization error here is significant, it propagates through all 61 MoE layers.
The vLLM weight loader does:
1. Compute per-block amax for the BF16 weight
2. Compute global scale: `amax_max / (6.0 * 448.0)`
3. Compute block scales: `amax / (6.0 * global_scale)` → float8
4. Nearest-neighbor E2M1 quantization
5. Pack 2 nibbles per byte: even→low, odd→high
This may need to stay in native BF16 and route through a BF16 matmul path instead.
**Status: Deferred.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.
### 7. ✅ BF16 reference comparison — COSINE ≈ 0
**Status: CONFIRMED.** The BF16 reference comparison ran (after fixing several bugs in the diagnostic code). Result: **cosine similarity ≈ 0** between NVFP4 GEMM output and BF16 dequantized reference. This means the CUTLASS kernel is producing output that is essentially uncorrelated with the correct result.
**Status: CONFIRMED.** Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.
Results from all 8 TP ranks:
```
[TP0] cosine=-0.001789 mse=1.0201e+01 nvfp4_amax=8.5625 ref_amax=8.0000
[TP1] cosine= 0.030470 mse=1.0157e+01 nvfp4_amax=8.0625 ref_amax=8.3125
@@ -72,21 +68,21 @@ Results from all 8 TP ranks:
[TP7] cosine=-0.010009 mse=9.0296e+00 nvfp4_amax=6.6250 ref_amax=36.500
```
**Key insight:** The magnitudes are in the same ballpark (amax 7-10 vs 8-10), but the *direction* is completely wrong. This is NOT a scaling error — it's a systematic misalignment. The output vectors are essentially random relative to the correct answer.
### 8. ✅ CUTLASS SF remap `size` vs `cosize` bug — ROOT CAUSE
**Status: FIXED (commit `c384198`).** The SF remap kernel iterated over `cute::size()` (logical) instead of `cute::cosize()` (physical with tile padding). Tile-padding positions in the CUTLASS interleaved SF layout were never written and stayed zero. With uniform SF (all-ones test) the bug was invisible. With non-uniform SF (real data) it produced cosine ≈ 0.
**This proves the problem is in the CUTLASS GEMM itself** (or the data layout going into it), NOT in the attention, weight loading, or scaling math. The standalone test with random data showed cosine 1.0, but real data gives cosine ≈ 0. The difference must be in data layout/stride/alignment that the random test didn't exercise.
### 8. 🔍 CUTLASS GEMM layout mismatch
**Status: Active investigation.** The standalone test used random data with simple row-major layout and got cosine 1.0. Real data also uses row-major layout, but cosine ≈ 0. Possible causes:
- **SF remap incorrect for specific M/N/K dimensions** — the remap was verified with coordinate probes for the standalone test dimensions, but real MoE dimensions (M=1, N=6144, K=7168) may expose a different code path
- **Activation layout** — `stage_activation` produces flat row-major packed E2M1, but CUTLASS may expect a different micro-tiling for the A matrix
- **Weight transpose convention** — after `transform_nvfp4_weights_for_mega_moe` transpose, the weight may not be in the layout CUTLASS expects for B (column-major vs row-major interpretation)
**How we proved it:**
1. All-ones GEMM test (M=1, N=32, K=32): cosine = 1.0
2. Random data GEMM test (M=1, N=32, K=32): cosine ≈ 0.2
3. Random data sweep (multiple dimensions): cosine ≈ 0 everywhere
4. The only difference: uniform vs non-uniform SF values → SF remap is the culprit
5. Found `cute::size` on line 128 when comment explicitly said use `cute::cosize`
## Key Commits
| Commit | Description |
|--------|-------------|
| `da5572f` | Stop folding global scale into float8 block scales (25% precision loss fix) |
| `da5572f` | Stop folding global scale into float8 block scales (precision loss fix) |
| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints |
| `995589a` | Add FP4 quantization round-trip diagnostic |
| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 |
@@ -96,35 +92,44 @@ Results from all 8 TP ranks:
| `755f9ad` | Fix per_expert_alpha ref + clean up BF16 reference scaling |
| `df916b8` | Fix gs.item() for multi-element tensor |
| `7739674` | Fix gs scalar conversion with .cpu().tolist() + add traceback |
| `1b63a46` | Update DEBUG_LOG with cosine≈0 finding |
| `fee5a97` | Fix cosine_similarity dim for M>0 |
| `f9330a1` | Standalone M=1 GEMM test with deterministic data |
| `363dd89` | Dimension sweep to isolate GEMM bug |
| `60f7f60` | Ultra-minimal GEMM with all-ones (cosine=1.0!) |
| `67dcfa8` | Random data at small dims + alpha sweep |
| `c384198` | **FIX: SF remap uses cute::cosize() instead of cute::size()** |
## Bugs Fixed During This Debug Session
### 🔴 ROOT CAUSE: SF remap `size` vs `cosize` (commit `c384198`)
**Bug:** In `cutlass_nvfp4_gemm.cu` line 128, the SF remap kernel used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. The `size` returns the logical element count; `cosize` returns the physical size including tile padding. The destination buffer was correctly allocated with `cosize` elements and zero-initialized, but the kernel only wrote to `size` positions, leaving tile-padding positions as zero.
**Why it was missed in the previous audit:** We changed all *allocation* sites from `size` to `cosize` (lines 179, 180, 232, 246, 287). The comment on lines 114-115 explicitly warned about this. But the *iteration bound* in the remap kernel itself (line 128) was overlooked — it was a different context (kernel launch parameter, not buffer allocation).
**Why the standalone test passed:** The previous standalone test used a single global scale for all blocks, producing uniform SF values. When all SF values are identical, missing writes don't matter — every position gets the same value regardless of which positions are written. The all-ones test in this session (M=1, N=32, K=32, cosine=1.0) confirmed this.
**Fix:** `int total = cute::size(layout_sf);``int total = cute::cosize(layout_sf);`
**Impact:** This was the root cause of all garbage output. Every GEMM call with non-uniform SF values was producing scrambled results.
### Weight nibble unpack reshape bug (commit `2fd55a9`)
**Bug:** In the BF16 reference diagnostic, `torch.stack([wlo, whi], dim=-1).reshape(w_u8.shape[0], -1)` on a 2D weight of shape `(K_half, N)` = `(3584, 6144)` produced `(3584, 12288)` instead of `(7168, 6144)`. The `-1` was consuming the N dimension.
**Bug:** In the BF16 reference diagnostic, `reshape(K_half, -1)` on 2D weight flattened N dimension.
**Fix:** Changed to `.reshape(w_u8.shape[0] * 2, w_u8.shape[1])` to preserve the column (N) dimension and double the row (K) dimension.
**Fix:** `reshape(K_half*2, N)`.
**Impact:** Only affected the BF16 reference diagnostic code, not the actual NVFP4 kernel. The CUTLASS kernel receives weights already in the correct packed format.
### igs double-count in reference (commit `2fd55a9`)
**Bug:** The BF16 reference multiplied by `igs` (input global scale) in `x_bf16` AND again in `ref_out = ref_out * igs`.
**Fix:** Removed the final `ref_out * igs` — it's already included via `x_bf16 = x_deq * sf_exp * igs`.
**Impact:** Only affected the BF16 reference diagnostic, not the kernel.
**Impact:** Only diagnostic code.
### BF16 reference diagnostic: multiple bugs (commits `c421a66`→`7739674`)
The BF16 reference comparison had a cascade of bugs that took 4 iterations to fix:
1. **Weight reshape:** `reshape(K_half, -1)``reshape(K_half*2, N)`
2. **per_expert_alpha not defined:** reference code ran before alpha was computed
3. **gs.item() on multi-element tensor:** `gs` is shape (2,); fixed with `.cpu().tolist()`
4. **igs double-count:** multiplying by igs in both x_bf16 and final output
1. **Weight reshape bug (commit `2fd55a9`):** `reshape(K_half, -1)` on 2D weight flattened N dimension. Fixed: `reshape(K_half*2, N)`.
2. **per_expert_alpha not defined (commit `755f9ad`):** The reference code ran before `per_expert_alpha` was computed. Fixed: use `l1_alpha * l1_global_sf[e0]` directly.
3. **gs.item() on multi-element tensor (commits `df916b8`, `7739674`):** `gs` is shape (2,) — `gs[0].item()` should work but didn't in context. Fixed: `gs.detach().cpu().tolist()`.
4. **igs double-count (commit `2fd55a9`):** Multiplying by igs in both x_bf16 and the final output. Fixed: apply igs once in x, apply gs per-half separately.
**Impact:** All bugs only in diagnostic code. The actual NVFP4 kernel was never affected.
**Impact:** All bugs only in diagnostic code.
## Architecture Notes
@@ -162,7 +167,8 @@ scatter with routing weights → y
## Next Steps
1. **Get BF16 reference cosine** — determine if the CUTLASS GEMM is correct
2. **If cosine ≈ 1.0:** Problem is upstream (attention, likely o_a_proj). Fix: keep o_a_proj in native BF16
3. **If cosine << 1.0:** Problem is in the CUTLASS GEMM or the activation quantization. Need to debug the kernel itself
4. **Test with SKIP_ATTENTION=1** — bypass attention, feed raw input to MoE. If output improves, confirms attention is the issue
1. **Rebuild container with cosize fix** — Mike rebuilds with commit `c384198`
2. **Run deterministic prompt** — "The capital of France is" should produce "Paris"
3. **Run standalone random GEMM test** — should now show cosine ≈ 1.0 with random data
4. **If output is still off:** investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)
5. **Once working:** clean up debug prints from production code