Update CURRENT_BUG.md and README with current state

Empty output still happening. Documented what's been tried, what works standalone, what we don't know, and the plan to bypass vLLM's kernel selection entirely by calling our runners directly.
2026-05-19 07:05:45 +00:00
parent 62abf41b03
commit dbaa3d6fe6
1 changed files with 73 additions and 58 deletions
--- a/CURRENT_BUG.md
+++ b/CURRENT_BUG.md
@@ -1,77 +1,92 @@
-# Current State: Building our own NVFP4 kernels
+# CURRENT_BUG.md — Empty Output from vLLM

-**Status:** WIP — shared expert CuTeDSL kernel in progress
-**Date:** 2026-05-18
+**Status:** BLOCKED — model loads, serves requests, returns empty content
+**Date:** 2026-05-19

-## What happened today
+## Symptom

-Spent the day debugging why vLLM produces empty/garbage output. The journey:
+```
+curl -X POST http://127.0.0.1:8000/v1/chat/completions \
+  -d '{"model":"/model","stream":true,"max_tokens":10,"temperature":0,
+        "messages":[{"role":"user","content":"The capital of France is"}]}'

-1. **NaN from layer 0** — diagnostic prints showed NaN from the very first layer
-2. **MoE kernel is fine** — standalone test: cosine 0.988, no NaN
-3. **Root cause: `FlashInferCutlassNvFp4LinearKernel` uses broken `input_scale`** — checkpoint values cause 3977x amplification during activation quantization → NaN
-4. **BF16 dequant fix** — dequantize NVFP4 weights to BF16, replace quant method
-5. **`process_weights_after_loading` timing** — our fix runs inside `load_weights()`, but vLLM's quant method runs AFTER. Fix gets overwritten.
-6. **Post-quant hook approach** — forward pre-hooks don't fire (torch.compile + model wrappers bypass them)
-7. **Patched `utils.py`** — added `_post_quant_fix()` call at end of `process_weights_after_loading`. This works — 305 projections dequantized to BF16.
-8. **Still garbage** — even with 183 attention + 122 shared expert projections in BF16, output is still empty.
-9. **Conclusion: vLLM's pipeline has deeper issues.** The `FlashInferCutlassNvFp4LinearKernel` is untrustworthy on B200 (same class of C++ CUTLASS FP4 bugs we hit with MoE). BF16 dequant doesn't fix it because something else is broken in vLLM's execution path.
+→ Returns: content="" finish_reason="length" (generates 10 tokens of nothing → EOS)
+```

-**Decision: Build our own NVFP4 kernels for shared experts and attention.** Same CuTeDSL approach that works for MoE. Stop fighting vLLM's broken kernels.
+Model loads fine. CUDAGraph captures. Inference runs. But every token is EOS.

-## Confirmed Working
+## What's Been Tried

-| Component | Kernel | Status |
-|-----------|--------|--------|
-| MoE experts (384/layer) | CuTeDSL ScaledGroupedGemm | ✅ cosine 0.988, cudagraph-safe |
-| All NVFP4 weights | Dequant to BF16 | ✅ Valid output in standalone test |
-| Full attention weight chain | BF16 matmul | ✅ No NaN, no zeros |
+| Attempt | Result | Why it failed |
+|---------|--------|---------------|
+| BF16 dequant all NVFP4 weights | Still garbage | Something else in vLLM pipeline is broken |
+| Patch `process_weights_after_loading` | Overwritten by vLLM quant method | Timing issue — our fix runs, then vLLM's quant overwrites it |
+| Forward pre-hooks | Never fire | torch.compile + model wrappers bypass them |
+| Patch `utils.py` with `_post_quant_fix()` | 305 projections dequantized, still garbage | vLLM's pipeline has deeper issues beyond the linear layers |
+| Replace `deepseek_v4_attention.py` with v0.21.0 version | Import errors (`breakable_cudagraph` missing) | Docker image uses v0.20.2rc1, not v0.21.0 stable |
+| Replace with Docker image's original | Nuked all our existing patches | The file already had our BF16 wo_a work — dumb mistake |

-## In Progress
+## Confirmed Working (standalone tests on B200)

-| Component | Kernel | Status |
-|-----------|--------|--------|
-| Shared experts | CuTeDSL GEMM (1 group) | 🔧 Runner WIP, scale assembly needs fixing |
-| Attention projections | CuTeDSL GEMM | 📋 Next after shared experts |
+| Component | Test | Result |
+|-----------|------|--------|
+| MoE CuTeDSL kernel | `layertest.py` | ✅ cosine 0.988 |
+| CUDAGraph | `cudagraph_test.py` | ✅ capture + replay |
+| wo_a BMM (BF16) | `test_o_projection_b200.py` | ✅ z amax=0.8, no NaN |
+| wo_b NVFP4 GEMM | `test_o_projection_b200.py` | ✅ cosine 0.996 |
+| inv RoPE roundtrip | `test_o_projection.py` | ✅ BF16 precision |
+| wo_a BMM vs einsum | `test_o_projection.py` | ✅ exact match |
+| All NVFP4 weight dequant | standalone | ✅ Valid BF16 output |

-## WIP: Shared Expert CuTeDSL Kernel
+## The Core Problem

-**Files:**
- `cutedsl/shared_expert_pipeline.py` — dedicated runner (needs scale assembly fix)
- `tests/test_shared_expert.py` — standalone test
+**We don't control the execution path.** vLLM intercepts our weights and routes them through kernels we can't debug:

-**Issue:** Tried reusing MoE runner with `num_experts=1` — fails because MoE runner's scatter assumes `hidden_size != HC_DIM`. The MoE runner does `output.scatter_add_` which expects expert output shape `[tokens, hidden_size]` but shared expert operates on HC_DIM (28672).
+1. **`FlashInferCutlassNvFp4LinearKernel`** — vLLM's built-in NVFP4 linear kernel. Same class of C++ CUTLASS FP4 bugs we hit with MoE. Used for ALL attention projections (fused_wqa_wkv, wq_b, wo_b) and shared experts (gate_up_proj, down_proj). Our `CuTeDSLNvFp4LinearKernel` is registered and selected by `init_nvfp4_linear_kernel`, but it's still going through vLLM's quantization layer which may interfere.

-**Fix needed:** Dedicated runner with correct scale assembly for `num_groups=1`. The MoE runner's `_assemble_scales_cudagraph_safe` is the template. For a single group:
- No expert offsets needed
- No scatter needed (all tokens go to the same expert)
- Scale assembly is just: quantize activation → pad to 128-row alignment → Blackwell swizzle
- Simpler than the MoE case
+2. **FP8 einsum for wo_a** — The attention forward (`deepseek_v4_attention.py`) does `fused_inv_rope_fp8_quant` + `deepseek_v4_fp8_einsum`. This expects `wo_a.weight_scale_inv` (FP8). Our checkpoint has wo_a as BF16. The `has_fp8_weights` check in our patched forward should handle this, but we need to verify it's actually running our path.

-## Plan
+3. **MHC (Hyper-connections)** — Pure PyTorch replacement. Should work. Not verified end-to-end.

-### Phase 1: Shared Expert Kernel (WIP)
-1. Fix `shared_expert_pipeline.py` — implement scale assembly for num_groups=1
-2. Test with `test_shared_expert.py` — target cosine ≥ 0.98 vs BF16 reference
-3. Add cudagraph test
-4. Wire into vLLM via `DeepseekV4MoE.forward()`
+4. **CSA/HCA/SWA** — vLLM's Triton kernels for sparse attention. JIT compiles on first request (the warnings Mike saw). These are vLLM internals, not our code. Should work. Not verified.

-### Phase 2: Attention NVFP4 Kernel
- Each attention projection is a standard NVFP4 GEMM
- `fused_wqa_wkv` has dual weight_scale_2 (same as MoE gate+up)
- Handle `wo_a` — currently FP8, could stay FP8 or go native NVFP4
- Test each projection individually, then integrate
+## What We Don't Know

-### Phase 3: Clean Up
- Remove all BF16 dequant code
- Remove `vllm/patches/utils.py` patch
- Remove `_post_quant_fix()` 
- All NVFP4 through CuTeDSL, no vLLM FlashInfer kernels
+- **Is `CuTeDSLNvFp4LinearKernel` actually being used?** Logs say yes, but we haven't verified the output of a single attention projection inside the container.
+- **Is the BF16 wo_a path actually running?** The `has_fp8_weights` check should detect our BF16 wo_a, but we haven't confirmed inside the container.
+- **Is the activation global scale correct?** We fixed the double-inversion bug (`1.0 / input_global_scale_inv` → just `input_global_scale_inv`), but haven't tested in the container.
+- **Are the MHC hyper-connections producing valid output?** Not tested.
+- **Is the attention itself (FlashMLA sparse) working on B200?** Not tested outside vLLM.

-## Memory Layout
+## The Decision: Our Own Path

-| Component | NVFP4 Size | BF16 Size | Notes |
-|-----------|-----------|-----------|-------|
-| Shared expert (per layer) | 33MB | 66MB | Small, 2GB total |
-| Attention (per layer) | ~TBD | ~TBD | 5 projections |
-| MoE experts (per layer) | ~TBD | ~TBD | 48 experts, stays NVFP4 |
+**Stop fighting vLLM's kernels.** Every time we patch around one, another breaks. The CuTeDSL kernel works (0.988 cosine). The CuTeDSL linear kernel works (0.996 cosine). But vLLM's pipeline keeps routing things through broken paths we can't control.
+
+**The plan:** Bypass vLLM's kernel selection entirely. Replace the model's forward methods to call our runners directly, not through vLLM's quantization layer. This means:
+
+1. **Attention projections** — Call `CuTeDSLNvfp4Linear` runners directly from the model forward, not through `ColumnParallelLinear`/`RowParallelLinear` which go through vLLM's quant dispatch.
+2. **Shared experts** — Same. Call `CuTeDSLNvfp4Linear` directly from `DeepseekV4MoE.forward()`.
+3. **MoE** — Already doing this (cutedsl_moe.py calls our runner directly).
+4. **wo_a** — BF16 BMM, already in our patched attention forward.
+5. **MHC** — Already our pure PyTorch replacement.
+
+The key insight: vLLM's `ModelOptNvFp4LinearMethod.process_weights_after_loading` is the bottleneck. It's the one that converts weights and routes to the kernel. We need to intercept BEFORE it, grab the raw weights, and store them in our runner format. Then the model forward calls our runner, not vLLM's quant method.
+
+## Test Infrastructure
+
+- `tests/` — CPU-only tests (run locally)
+- `tests/venv/` — B200 venv with torch + CuTeDSL + safetensors
+- `tests/test_o_projection_b200.py` — Loads real weights, tests O-projection (✅ PASS)
+- `tests/test_o_projection.py` — CPU tests for inv RoPE + BMM (✅ PASS)
+
+**Rule: Test everything standalone on B200 BEFORE touching the container.**
+
+## Files That Have Our Patches (DO NOT NUKE)
+
+| File | What it has | DON'T |
+|------|-------------|-------|
+| `vllm/patches/deepseek_v4.py` | Model patch: BF16 wo_a, weight mapper, NVFP4 config | Don't replace with stock vLLM |
+| `vllm/patches/deepseek_v4_attention.py` | BF16 wo_a path with `has_fp8_weights` check, `_apply_inv_rope_bf16`, BMM + all-gather, FP8 fallback | Don't replace with stock vLLM |
+| `vllm/patches/layers/mhc.py` | Pure PyTorch MHC (replaces TileLang) | Don't replace |
+| `vllm/kernels/linear/nvfp4/cutedsl.py` | `CuTeDSLNvFp4LinearKernel` — our NVFP4 linear kernel | Don't replace |
+| `vllm/patches/fused_moe/experts/cutedsl_moe.py` | CuTeDSL MoE backend | Don't replace |