Empty output still happening. Documented what's been tried, what works standalone, what we don't know, and the plan to bypass vLLM's kernel selection entirely by calling our runners directly.
6.0 KiB
CURRENT_BUG.md — Empty Output from vLLM
Status: BLOCKED — model loads, serves requests, returns empty content Date: 2026-05-19
Symptom
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-d '{"model":"/model","stream":true,"max_tokens":10,"temperature":0,
"messages":[{"role":"user","content":"The capital of France is"}]}'
→ Returns: content="" finish_reason="length" (generates 10 tokens of nothing → EOS)
Model loads fine. CUDAGraph captures. Inference runs. But every token is EOS.
What's Been Tried
| Attempt | Result | Why it failed |
|---|---|---|
| BF16 dequant all NVFP4 weights | Still garbage | Something else in vLLM pipeline is broken |
Patch process_weights_after_loading |
Overwritten by vLLM quant method | Timing issue — our fix runs, then vLLM's quant overwrites it |
| Forward pre-hooks | Never fire | torch.compile + model wrappers bypass them |
Patch utils.py with _post_quant_fix() |
305 projections dequantized, still garbage | vLLM's pipeline has deeper issues beyond the linear layers |
Replace deepseek_v4_attention.py with v0.21.0 version |
Import errors (breakable_cudagraph missing) |
Docker image uses v0.20.2rc1, not v0.21.0 stable |
| Replace with Docker image's original | Nuked all our existing patches | The file already had our BF16 wo_a work — dumb mistake |
Confirmed Working (standalone tests on B200)
| Component | Test | Result |
|---|---|---|
| MoE CuTeDSL kernel | layertest.py |
✅ cosine 0.988 |
| CUDAGraph | cudagraph_test.py |
✅ capture + replay |
| wo_a BMM (BF16) | test_o_projection_b200.py |
✅ z amax=0.8, no NaN |
| wo_b NVFP4 GEMM | test_o_projection_b200.py |
✅ cosine 0.996 |
| inv RoPE roundtrip | test_o_projection.py |
✅ BF16 precision |
| wo_a BMM vs einsum | test_o_projection.py |
✅ exact match |
| All NVFP4 weight dequant | standalone | ✅ Valid BF16 output |
The Core Problem
We don't control the execution path. vLLM intercepts our weights and routes them through kernels we can't debug:
-
FlashInferCutlassNvFp4LinearKernel— vLLM's built-in NVFP4 linear kernel. Same class of C++ CUTLASS FP4 bugs we hit with MoE. Used for ALL attention projections (fused_wqa_wkv, wq_b, wo_b) and shared experts (gate_up_proj, down_proj). OurCuTeDSLNvFp4LinearKernelis registered and selected byinit_nvfp4_linear_kernel, but it's still going through vLLM's quantization layer which may interfere. -
FP8 einsum for wo_a — The attention forward (
deepseek_v4_attention.py) doesfused_inv_rope_fp8_quant+deepseek_v4_fp8_einsum. This expectswo_a.weight_scale_inv(FP8). Our checkpoint has wo_a as BF16. Thehas_fp8_weightscheck in our patched forward should handle this, but we need to verify it's actually running our path. -
MHC (Hyper-connections) — Pure PyTorch replacement. Should work. Not verified end-to-end.
-
CSA/HCA/SWA — vLLM's Triton kernels for sparse attention. JIT compiles on first request (the warnings Mike saw). These are vLLM internals, not our code. Should work. Not verified.
What We Don't Know
- Is
CuTeDSLNvFp4LinearKernelactually being used? Logs say yes, but we haven't verified the output of a single attention projection inside the container. - Is the BF16 wo_a path actually running? The
has_fp8_weightscheck should detect our BF16 wo_a, but we haven't confirmed inside the container. - Is the activation global scale correct? We fixed the double-inversion bug (
1.0 / input_global_scale_inv→ justinput_global_scale_inv), but haven't tested in the container. - Are the MHC hyper-connections producing valid output? Not tested.
- Is the attention itself (FlashMLA sparse) working on B200? Not tested outside vLLM.
The Decision: Our Own Path
Stop fighting vLLM's kernels. Every time we patch around one, another breaks. The CuTeDSL kernel works (0.988 cosine). The CuTeDSL linear kernel works (0.996 cosine). But vLLM's pipeline keeps routing things through broken paths we can't control.
The plan: Bypass vLLM's kernel selection entirely. Replace the model's forward methods to call our runners directly, not through vLLM's quantization layer. This means:
- Attention projections — Call
CuTeDSLNvfp4Linearrunners directly from the model forward, not throughColumnParallelLinear/RowParallelLinearwhich go through vLLM's quant dispatch. - Shared experts — Same. Call
CuTeDSLNvfp4Lineardirectly fromDeepseekV4MoE.forward(). - MoE — Already doing this (cutedsl_moe.py calls our runner directly).
- wo_a — BF16 BMM, already in our patched attention forward.
- MHC — Already our pure PyTorch replacement.
The key insight: vLLM's ModelOptNvFp4LinearMethod.process_weights_after_loading is the bottleneck. It's the one that converts weights and routes to the kernel. We need to intercept BEFORE it, grab the raw weights, and store them in our runner format. Then the model forward calls our runner, not vLLM's quant method.
Test Infrastructure
tests/— CPU-only tests (run locally)tests/venv/— B200 venv with torch + CuTeDSL + safetensorstests/test_o_projection_b200.py— Loads real weights, tests O-projection (✅ PASS)tests/test_o_projection.py— CPU tests for inv RoPE + BMM (✅ PASS)
Rule: Test everything standalone on B200 BEFORE touching the container.
Files That Have Our Patches (DO NOT NUKE)
| File | What it has | DON'T |
|---|---|---|
vllm/patches/deepseek_v4.py |
Model patch: BF16 wo_a, weight mapper, NVFP4 config | Don't replace with stock vLLM |
vllm/patches/deepseek_v4_attention.py |
BF16 wo_a path with has_fp8_weights check, _apply_inv_rope_bf16, BMM + all-gather, FP8 fallback |
Don't replace with stock vLLM |
vllm/patches/layers/mhc.py |
Pure PyTorch MHC (replaces TileLang) | Don't replace |
vllm/kernels/linear/nvfp4/cutedsl.py |
CuTeDSLNvFp4LinearKernel — our NVFP4 linear kernel |
Don't replace |
vllm/patches/fused_moe/experts/cutedsl_moe.py |
CuTeDSL MoE backend | Don't replace |