Files

biondizzle dbaa3d6fe6 Update CURRENT_BUG.md and README with current state

Empty output still happening. Documented what's been tried, what works
standalone, what we don't know, and the plan to bypass vLLM's kernel
selection entirely by calling our runners directly.

2026-05-19 07:05:45 +00:00

6.0 KiB

Raw Blame History

CURRENT_BUG.md — Empty Output from vLLM

Status: BLOCKED — model loads, serves requests, returns empty content Date: 2026-05-19

Symptom

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -d '{"model":"/model","stream":true,"max_tokens":10,"temperature":0,
        "messages":[{"role":"user","content":"The capital of France is"}]}'

→ Returns: content="" finish_reason="length" (generates 10 tokens of nothing → EOS)

Model loads fine. CUDAGraph captures. Inference runs. But every token is EOS.

What's Been Tried

Attempt	Result	Why it failed
BF16 dequant all NVFP4 weights	Still garbage	Something else in vLLM pipeline is broken
Patch `process_weights_after_loading`	Overwritten by vLLM quant method	Timing issue — our fix runs, then vLLM's quant overwrites it
Forward pre-hooks	Never fire	torch.compile + model wrappers bypass them
Patch `utils.py` with `_post_quant_fix()`	305 projections dequantized, still garbage	vLLM's pipeline has deeper issues beyond the linear layers
Replace `deepseek_v4_attention.py` with v0.21.0 version	Import errors (`breakable_cudagraph` missing)	Docker image uses v0.20.2rc1, not v0.21.0 stable
Replace with Docker image's original	Nuked all our existing patches	The file already had our BF16 wo_a work — dumb mistake

Confirmed Working (standalone tests on B200)

Component	Test	Result
MoE CuTeDSL kernel	`layertest.py`	✅ cosine 0.988
CUDAGraph	`cudagraph_test.py`	✅ capture + replay
wo_a BMM (BF16)	`test_o_projection_b200.py`	✅ z amax=0.8, no NaN
wo_b NVFP4 GEMM	`test_o_projection_b200.py`	✅ cosine 0.996
inv RoPE roundtrip	`test_o_projection.py`	✅ BF16 precision
wo_a BMM vs einsum	`test_o_projection.py`	✅ exact match
All NVFP4 weight dequant	standalone	✅ Valid BF16 output

The Core Problem

We don't control the execution path. vLLM intercepts our weights and routes them through kernels we can't debug:

FlashInferCutlassNvFp4LinearKernel — vLLM's built-in NVFP4 linear kernel. Same class of C++ CUTLASS FP4 bugs we hit with MoE. Used for ALL attention projections (fused_wqa_wkv, wq_b, wo_b) and shared experts (gate_up_proj, down_proj). Our CuTeDSLNvFp4LinearKernel is registered and selected by init_nvfp4_linear_kernel, but it's still going through vLLM's quantization layer which may interfere.
FP8 einsum for wo_a — The attention forward (deepseek_v4_attention.py) does fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum. This expects wo_a.weight_scale_inv (FP8). Our checkpoint has wo_a as BF16. The has_fp8_weights check in our patched forward should handle this, but we need to verify it's actually running our path.
MHC (Hyper-connections) — Pure PyTorch replacement. Should work. Not verified end-to-end.
CSA/HCA/SWA — vLLM's Triton kernels for sparse attention. JIT compiles on first request (the warnings Mike saw). These are vLLM internals, not our code. Should work. Not verified.

What We Don't Know

Is CuTeDSLNvFp4LinearKernel actually being used? Logs say yes, but we haven't verified the output of a single attention projection inside the container.
Is the BF16 wo_a path actually running? The has_fp8_weights check should detect our BF16 wo_a, but we haven't confirmed inside the container.
Is the activation global scale correct? We fixed the double-inversion bug (1.0 / input_global_scale_inv → just input_global_scale_inv), but haven't tested in the container.
Are the MHC hyper-connections producing valid output? Not tested.
Is the attention itself (FlashMLA sparse) working on B200? Not tested outside vLLM.

The Decision: Our Own Path

Stop fighting vLLM's kernels. Every time we patch around one, another breaks. The CuTeDSL kernel works (0.988 cosine). The CuTeDSL linear kernel works (0.996 cosine). But vLLM's pipeline keeps routing things through broken paths we can't control.

The plan: Bypass vLLM's kernel selection entirely. Replace the model's forward methods to call our runners directly, not through vLLM's quantization layer. This means:

Attention projections — Call CuTeDSLNvfp4Linear runners directly from the model forward, not through ColumnParallelLinear/RowParallelLinear which go through vLLM's quant dispatch.
Shared experts — Same. Call CuTeDSLNvfp4Linear directly from DeepseekV4MoE.forward().
MoE — Already doing this (cutedsl_moe.py calls our runner directly).
wo_a — BF16 BMM, already in our patched attention forward.
MHC — Already our pure PyTorch replacement.

The key insight: vLLM's ModelOptNvFp4LinearMethod.process_weights_after_loading is the bottleneck. It's the one that converts weights and routes to the kernel. We need to intercept BEFORE it, grab the raw weights, and store them in our runner format. Then the model forward calls our runner, not vLLM's quant method.

Test Infrastructure

tests/ — CPU-only tests (run locally)
tests/venv/ — B200 venv with torch + CuTeDSL + safetensors
tests/test_o_projection_b200.py — Loads real weights, tests O-projection (✅ PASS)
tests/test_o_projection.py — CPU tests for inv RoPE + BMM (✅ PASS)

Rule: Test everything standalone on B200 BEFORE touching the container.

Files That Have Our Patches (DO NOT NUKE)

File	What it has	DON'T
`vllm/patches/deepseek_v4.py`	Model patch: BF16 wo_a, weight mapper, NVFP4 config	Don't replace with stock vLLM
`vllm/patches/deepseek_v4_attention.py`	BF16 wo_a path with `has_fp8_weights` check, `_apply_inv_rope_bf16`, BMM + all-gather, FP8 fallback	Don't replace with stock vLLM
`vllm/patches/layers/mhc.py`	Pure PyTorch MHC (replaces TileLang)	Don't replace
`vllm/kernels/linear/nvfp4/cutedsl.py`	`CuTeDSLNvFp4LinearKernel` — our NVFP4 linear kernel	Don't replace
`vllm/patches/fused_moe/experts/cutedsl_moe.py`	CuTeDSL MoE backend	Don't replace

6.0 KiB Raw Blame History