nvfp4-megamoe-kernel/DEBUG_LOG.md

# NVFP4 MegaMoE Debug Log

## Current State (May 15, 2026)

**Status:** Model produces garbage output. Deterministic prompt "The capital of France is" produces `-W'MSG173 ~SB…abych` instead of "Paris".

## Symptoms

- No NaN/Inf anywhere in the pipeline
- Magnitudes at each stage look reasonable:
  - L1 GEMM output: amax ~8-10
  - SiLU activation: amax ~34-43
  - L2 GEMM output: amax ~17-28
  - Scatter output: amax ~5-15
- FP4 activation quantization round-trip: reconstructed amax ~3.86, igs=1.4e-3
- All 8 TP ranks produce identical weight shapes after transformation
- Experts have distinct weights and scales (not duplicated)

**Key observation:** The signal is there but buried. "Paris" appears at rank 2075/129280 — the model knows the word exists (logit 9.25) but top logits point at garbage tokens. This suggests a systematic error that preserves magnitude but distorts direction.

## Pipeline Trace (per layer, from last inference)

```
[L1-GEMM-OUT]  slots=1 N=6144 amax=8.6250e+00
[L1-SPLIT]     gate amax=7.1250e+00 | up amax=8.6250e+00
[SILU-ACT]     amax=3.4500e+01
[L2-GEMM-OUT]  slots=1 N=7168 amax=1.8500e+01
[SCATTER]      y amax=6.7500e+00 slots=1
```

## Hypotheses Investigated

### 1. ❌ NaN/Inf in GEMM
Ruled out. All outputs finite, no NaN detected at any stage.

### 2. ❌ Weight shape mismatch
Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).

### 3. ❌ Global scale folding precision loss
Previously identified (commit `da5572f`). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision in low-precision zone. Fixed by passing global scales as per-expert alpha instead of folding. Did not fix the garbage output.

### 4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation (commit history). Standalone test shows cosine=1.0 and MSE=0.0 for random data.

### 5. 🔍 E2M1 packing convention mismatch
**Status: Open.** The CUTLASS kernel expects `nv_float4_t<float_e2m1_t>` packed as 2 nibbles per byte. Our `stage_activation` packs `(nibbles[..., 1] << 4) | nibbles[..., 0]` (even→low, odd→high). The checkpoint weights use the same convention. The standalone test showed cosine 1.0 with this packing, but both A and B were packed the same way — if both are wrong in the same way, the error cancels.

### 6. 🔍 Attention output corruption from o_a_proj quantization
**Status: Active investigation.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time because the model parameter is declared uint8. This is a lossy conversion of a 64M-parameter matrix that sits right before the MoE. If the quantization error here is significant, it propagates through all 61 MoE layers.

The vLLM weight loader does:
1. Compute per-block amax for the BF16 weight
2. Compute global scale: `amax_max / (6.0 * 448.0)`
3. Compute block scales: `amax / (6.0 * global_scale)` → float8
4. Nearest-neighbor E2M1 quantization
5. Pack 2 nibbles per byte: even→low, odd→high

This may need to stay in native BF16 and route through a BF16 matmul path instead.

### 7. 🔍 BF16 reference comparison
**Status: In progress.** Adding a diagnostic that dequantizes FP4 activation + FP4 weights back to BF16, runs a reference matmul, then compares to the NVFP4 GEMM output via cosine similarity. This will isolate whether the CUTLASS kernel is producing correct output given the same quantized inputs.

## Key Commits

| Commit | Description |
|--------|-------------|
| `da5572f` | Stop folding global scale into float8 block scales (25% precision loss fix) |
| `d0ed3d8` | Add L2, SiLU, and scatter pipeline prints |
| `995589a` | Add FP4 quantization round-trip diagnostic |
| `c421a66` | Add BF16 reference GEMM + cosine comparison for L1 |
| `2fd55a9` | Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count |

## Bugs Fixed During This Debug Session

### Weight nibble unpack reshape bug (commit `2fd55a9`)

**Bug:** In the BF16 reference diagnostic, `torch.stack([wlo, whi], dim=-1).reshape(w_u8.shape[0], -1)` on a 2D weight of shape `(K_half, N)` = `(3584, 6144)` produced `(3584, 12288)` instead of `(7168, 6144)`. The `-1` was consuming the N dimension.

**Fix:** Changed to `.reshape(w_u8.shape[0] * 2, w_u8.shape[1])` to preserve the column (N) dimension and double the row (K) dimension.

**Impact:** Only affected the BF16 reference diagnostic code, not the actual NVFP4 kernel. The CUTLASS kernel receives weights already in the correct packed format.

### igs double-count in reference (commit `2fd55a9`)

**Bug:** The BF16 reference multiplied by `igs` (input global scale) in `x_bf16` AND again in `ref_out = ref_out * igs`.

**Fix:** Removed the final `ref_out * igs` — it's already included via `x_bf16 = x_deq * sf_exp * igs`.

**Impact:** Only affected the BF16 reference diagnostic, not the kernel.

## Architecture Notes

### DeepSeek-V4 MoE Layer Forward Pass
```
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x)                          ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)

residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x)                           ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
```

### NVFP4 MoE Pipeline
```
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
```

### Checkpoint Layers (layer 0)
- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj)
- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt
- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4
- **Gate weight, norms, sinks, position_bias:** BF16

## Next Steps

1. **Get BF16 reference cosine** — determine if the CUTLASS GEMM is correct
2. **If cosine ≈ 1.0:** Problem is upstream (attention, likely o_a_proj). Fix: keep o_a_proj in native BF16
3. **If cosine << 1.0:** Problem is in the CUTLASS GEMM or the activation quantization. Need to debug the kernel itself
4. **Test with SKIP_ATTENTION=1** — bypass attention, feed raw input to MoE. If output improves, confirms attention is the issue