nvfp4-megamoe-kernel/CURRENT_BUG.md

# Current Bug: vLLM produces NaN from layer 0

**Status:** Active debugging — BF16 dequant fix in progress
**Date:** 2026-05-18

## Symptom
- vLLM server starts, loads model, but every inference produces NaN logits
- Diagnostic prints show **NaN from layer 0 onward** — no layer ever produces valid output
- Empty content in chat completions, NaN in logprobs

## Root Cause (in progress)

**The attention NVFP4 linear layers produce NaN immediately.**

The attention projections go through vLLM's `FlashInferCutlassNvFp4LinearKernel` which uses checkpoint `input_scale` as the activation global scale for `scaled_fp4_quant()`. The checkpoint `input_scale` values are wrong for this use case, causing overflow → NaN.

### What we've tried

1. ✅ **MoE kernel is NOT the problem** — `test_runner_vllm_style.py` with warmup gs gives cosine 0.988, no NaN
2. ❌ **Dequant ALL attn projections to BF16** — crashed: `wo_a.weight_scale_inv` missing (fp8_einsum needs it)
3. ❌ **Dequant all except wo_a (keep wo_a as FP8)** — still NaN from layer 0. `wq_a` and `wkv` don't exist as separate attrs — they're **fused as `fused_wqa_wkv`**
4. ❌ **Changed to dequant `fused_wqa_wkv`** — still NaN from layer 0. Debug prints added to check if the attrs are actually found.

### Current theory

The BF16 dequant code may not be finding `fused_wqa_wkv` on the attention module, so it silently skips the most important projection. Debug logging added in latest commit to verify.

### Attention architecture (DeepSeek V4 MLA)

- `fused_wqa_wkv` — MergedColumnParallelLinear (q_a + kv fused)
- `wq_b` — ColumnParallelLinear (second Q projection after RoPE)
- `wo_a` — ColumnParallelLinear (FP8 via fp8_einsum, weight-only, NO input_scale)
- `wo_b` — ColumnParallelLinear (final output projection)
- `compressor` — already handled (reconstructed to BF16 from checkpoint)

### Why `wo_a` is safe as FP8

`wo_a` uses `fp8_einsum` which does `output = fp8_act * fp8_weight * scale`. It's a **weight-only FP8** GEMM — no `input_scale` involved. The NaN comes from `scaled_fp4_quant(x, input_global_scale_inv)` in the other projections.

## Key evidence

- `q_a_proj.input_scale = 0.00025141` → `1/input_scale = 3977.6` → quantizing activations with amax ~2-8 by 3977.6x = massive overflow
- `q_b_proj.input_scale = 0.00006140` → `1/input_scale = 16287.1` → even worse
- Embedding values: amax=1.27, std=0.09 — very small values that get multiplied by thousands during quantization

## Next steps

1. Check debug logs to see which projections were actually dequantized
2. If `fused_wqa_wkv` wasn't found, fix the attribute path
3. If it was found and dequantized, the NaN source is elsewhere (wo_b? wq_b? something else?)
4. Consider: maybe the NaN is from the **KV cache FP8 quantization** or the **RoPE** implementation

## Docker/Build Notes

- Build: `screen -dmS build bash -c './build_and_run.sh 2>&1 | tee build.log'`
- Currently using `--enforce-eager` + `CLAWMINE_DEBUG=1` for diagnostics
- Don't hit the API with enforce-eager — JIT spikes crash the container
- For real testing: use compilation-config `{"cudagraph_mode": "NONE", "custom_ops": ["all"]}` instead of enforce-eager