Files
nvfp4-megamoe-kernel/DEBUG_LOG.md

131 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NVFP4 MegaMoE Debug Log
## Current State (May 15, 2026)
**Status:** Model produces garbage output. Deterministic prompt "The capital of France is" produces `-W'MSG173 ~SB…abych` instead of "Paris".
## Symptoms
- No NaN/Inf anywhere in the pipeline
- Magnitudes at each stage look reasonable:
- L1 GEMM output: amax ~8-10
- SiLU activation: amax ~34-43
- L2 GEMM output: amax ~17-28
- Scatter output: amax ~5-15
- FP4 activation quantization round-trip: reconstructed amax ~3.86, igs=1.4e-3
- All 8 TP ranks produce identical weight shapes after transformation
- Experts have distinct weights and scales (not duplicated)
**Key observation:** The signal is there but buried. "Paris" appears at rank 2075/129280 — the model knows the word exists (logit 9.25) but top logits point at garbage tokens. This suggests a systematic error that preserves magnitude but distorts direction.
## Pipeline Trace (per layer, from last inference)
```
[L1-GEMM-OUT] slots=1 N=6144 amax=8.6250e+00
[L1-SPLIT] gate amax=7.1250e+00 | up amax=8.6250e+00
[SILU-ACT] amax=3.4500e+01
[L2-GEMM-OUT] slots=1 N=7168 amax=1.8500e+01
[SCATTER] y amax=6.7500e+00 slots=1
```
## Hypotheses Investigated
### 1. ❌ NaN/Inf in GEMM
Ruled out. All outputs finite, no NaN detected at any stage.
### 2. ❌ Weight shape mismatch
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).
### 3. ❌ Global scale folding precision loss
Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision in low-precision zone. Fixed by passing global scales as per-expert alpha instead of folding. Did not fix the garbage output.
### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation (commit history). Standalone test shows cosine=1.0 and MSE=0.0 for random data.
### 5. 🔍 E2M1 packing convention mismatch
**Status: Open.** The CUTLASS kernel expects `nv_float4_t<float_e2m1_t>` packed as 2 nibbles per byte. Our `stage_activation` packs `(nibbles[..., 1] << 4) | nibbles[..., 0]` (even→low, odd→high). The checkpoint weights use the same convention. The standalone test showed cosine 1.0 with this packing, but both A and B were packed the same way — if both are wrong in the same way, the error cancels.
### 6. 🔍 Attention output corruption from o_a_proj quantization
**Status: Active investigation.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time because the model parameter is declared uint8. This is a lossy conversion of a 64M-parameter matrix that sits right before the MoE. If the quantization error here is significant, it propagates through all 61 MoE layers.
The vLLM weight loader does:
1. Compute per-block amax for the BF16 weight
2. Compute global scale: `amax_max / (6.0 * 448.0)`
3. Compute block scales: `amax / (6.0 * global_scale)` → float8
4. Nearest-neighbor E2M1 quantization
5. Pack 2 nibbles per byte: even→low, odd→high
This may need to stay in native BF16 and route through a BF16 matmul path instead.
### 7. 🔍 BF16 reference comparison
**Status: In progress.** Adding a diagnostic that dequantizes FP4 activation + FP4 weights back to BF16, runs a reference matmul, then compares to the NVFP4 GEMM output via cosine similarity. This will isolate whether the CUTLASS kernel is producing correct output given the same quantized inputs.
## Key Commits
| Commit | Description |
|--------|-------------|
| `da5572f` | Stop folding global scale into float8 block scales (25% precision loss fix) |
| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints |
| `995589a` | Add FP4 quantization round-trip diagnostic |
| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 |
| `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |
## Bugs Fixed During This Debug Session
### Weight nibble unpack reshape bug (commit `2fd55a9`)
**Bug:** In the BF16 reference diagnostic, `torch.stack([wlo, whi], dim=-1).reshape(w_u8.shape[0], -1)` on a 2D weight of shape `(K_half, N)` = `(3584, 6144)` produced `(3584, 12288)` instead of `(7168, 6144)`. The `-1` was consuming the N dimension.
**Fix:** Changed to `.reshape(w_u8.shape[0] * 2, w_u8.shape[1])` to preserve the column (N) dimension and double the row (K) dimension.
**Impact:** Only affected the BF16 reference diagnostic code, not the actual NVFP4 kernel. The CUTLASS kernel receives weights already in the correct packed format.
### igs double-count in reference (commit `2fd55a9`)
**Bug:** The BF16 reference multiplied by `igs` (input global scale) in `x_bf16` AND again in `ref_out = ref_out * igs`.
**Fix:** Removed the final `ref_out * igs` — it's already included via `x_bf16 = x_deq * sf_exp * igs`.
**Impact:** Only affected the BF16 reference diagnostic, not the kernel.
## Architecture Notes
### DeepSeek-V4 MoE Layer Forward Pass
```
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)
residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x) ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
```
### NVFP4 MoE Pipeline
```
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
```
### Checkpoint Layers (layer 0)
- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj)
- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt
- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4
- **Gate weight, norms, sinks, position_bias:** BF16
## Next Steps
1. **Get BF16 reference cosine** — determine if the CUTLASS GEMM is correct
2. **If cosine ≈ 1.0:** Problem is upstream (attention, likely o_a_proj). Fix: keep o_a_proj in native BF16
3. **If cosine << 1.0:** Problem is in the CUTLASS GEMM or the activation quantization. Need to debug the kernel itself
4. **Test with SKIP_ATTENTION=1** — bypass attention, feed raw input to MoE. If output improves, confirms attention is the issue