Files
nvfp4-megamoe-kernel/CURRENT_BUG.md
biondizzle dbaa3d6fe6 Update CURRENT_BUG.md and README with current state
Empty output still happening. Documented what's been tried, what works
standalone, what we don't know, and the plan to bypass vLLM's kernel
selection entirely by calling our runners directly.
2026-05-19 07:05:45 +00:00

6.0 KiB

CURRENT_BUG.md — Empty Output from vLLM

Status: BLOCKED — model loads, serves requests, returns empty content Date: 2026-05-19

Symptom

curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -d '{"model":"/model","stream":true,"max_tokens":10,"temperature":0,
        "messages":[{"role":"user","content":"The capital of France is"}]}'

→ Returns: content="" finish_reason="length" (generates 10 tokens of nothing → EOS)

Model loads fine. CUDAGraph captures. Inference runs. But every token is EOS.

What's Been Tried

Attempt Result Why it failed
BF16 dequant all NVFP4 weights Still garbage Something else in vLLM pipeline is broken
Patch process_weights_after_loading Overwritten by vLLM quant method Timing issue — our fix runs, then vLLM's quant overwrites it
Forward pre-hooks Never fire torch.compile + model wrappers bypass them
Patch utils.py with _post_quant_fix() 305 projections dequantized, still garbage vLLM's pipeline has deeper issues beyond the linear layers
Replace deepseek_v4_attention.py with v0.21.0 version Import errors (breakable_cudagraph missing) Docker image uses v0.20.2rc1, not v0.21.0 stable
Replace with Docker image's original Nuked all our existing patches The file already had our BF16 wo_a work — dumb mistake

Confirmed Working (standalone tests on B200)

Component Test Result
MoE CuTeDSL kernel layertest.py cosine 0.988
CUDAGraph cudagraph_test.py capture + replay
wo_a BMM (BF16) test_o_projection_b200.py z amax=0.8, no NaN
wo_b NVFP4 GEMM test_o_projection_b200.py cosine 0.996
inv RoPE roundtrip test_o_projection.py BF16 precision
wo_a BMM vs einsum test_o_projection.py exact match
All NVFP4 weight dequant standalone Valid BF16 output

The Core Problem

We don't control the execution path. vLLM intercepts our weights and routes them through kernels we can't debug:

  1. FlashInferCutlassNvFp4LinearKernel — vLLM's built-in NVFP4 linear kernel. Same class of C++ CUTLASS FP4 bugs we hit with MoE. Used for ALL attention projections (fused_wqa_wkv, wq_b, wo_b) and shared experts (gate_up_proj, down_proj). Our CuTeDSLNvFp4LinearKernel is registered and selected by init_nvfp4_linear_kernel, but it's still going through vLLM's quantization layer which may interfere.

  2. FP8 einsum for wo_a — The attention forward (deepseek_v4_attention.py) does fused_inv_rope_fp8_quant + deepseek_v4_fp8_einsum. This expects wo_a.weight_scale_inv (FP8). Our checkpoint has wo_a as BF16. The has_fp8_weights check in our patched forward should handle this, but we need to verify it's actually running our path.

  3. MHC (Hyper-connections) — Pure PyTorch replacement. Should work. Not verified end-to-end.

  4. CSA/HCA/SWA — vLLM's Triton kernels for sparse attention. JIT compiles on first request (the warnings Mike saw). These are vLLM internals, not our code. Should work. Not verified.

What We Don't Know

  • Is CuTeDSLNvFp4LinearKernel actually being used? Logs say yes, but we haven't verified the output of a single attention projection inside the container.
  • Is the BF16 wo_a path actually running? The has_fp8_weights check should detect our BF16 wo_a, but we haven't confirmed inside the container.
  • Is the activation global scale correct? We fixed the double-inversion bug (1.0 / input_global_scale_inv → just input_global_scale_inv), but haven't tested in the container.
  • Are the MHC hyper-connections producing valid output? Not tested.
  • Is the attention itself (FlashMLA sparse) working on B200? Not tested outside vLLM.

The Decision: Our Own Path

Stop fighting vLLM's kernels. Every time we patch around one, another breaks. The CuTeDSL kernel works (0.988 cosine). The CuTeDSL linear kernel works (0.996 cosine). But vLLM's pipeline keeps routing things through broken paths we can't control.

The plan: Bypass vLLM's kernel selection entirely. Replace the model's forward methods to call our runners directly, not through vLLM's quantization layer. This means:

  1. Attention projections — Call CuTeDSLNvfp4Linear runners directly from the model forward, not through ColumnParallelLinear/RowParallelLinear which go through vLLM's quant dispatch.
  2. Shared experts — Same. Call CuTeDSLNvfp4Linear directly from DeepseekV4MoE.forward().
  3. MoE — Already doing this (cutedsl_moe.py calls our runner directly).
  4. wo_a — BF16 BMM, already in our patched attention forward.
  5. MHC — Already our pure PyTorch replacement.

The key insight: vLLM's ModelOptNvFp4LinearMethod.process_weights_after_loading is the bottleneck. It's the one that converts weights and routes to the kernel. We need to intercept BEFORE it, grab the raw weights, and store them in our runner format. Then the model forward calls our runner, not vLLM's quant method.

Test Infrastructure

  • tests/ — CPU-only tests (run locally)
  • tests/venv/ — B200 venv with torch + CuTeDSL + safetensors
  • tests/test_o_projection_b200.py — Loads real weights, tests O-projection ( PASS)
  • tests/test_o_projection.py — CPU tests for inv RoPE + BMM ( PASS)

Rule: Test everything standalone on B200 BEFORE touching the container.

Files That Have Our Patches (DO NOT NUKE)

File What it has DON'T
vllm/patches/deepseek_v4.py Model patch: BF16 wo_a, weight mapper, NVFP4 config Don't replace with stock vLLM
vllm/patches/deepseek_v4_attention.py BF16 wo_a path with has_fp8_weights check, _apply_inv_rope_bf16, BMM + all-gather, FP8 fallback Don't replace with stock vLLM
vllm/patches/layers/mhc.py Pure PyTorch MHC (replaces TileLang) Don't replace
vllm/kernels/linear/nvfp4/cutedsl.py CuTeDSLNvFp4LinearKernel — our NVFP4 linear kernel Don't replace
vllm/patches/fused_moe/experts/cutedsl_moe.py CuTeDSL MoE backend Don't replace