Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow.
5.2 KiB
Current Bug: vLLM produces empty/garbage output
Status: Debugging, plan revised — building our own kernels Date: 2026-05-18
Symptom
- vLLM server starts, loads model, processes requests (200 OK)
- Chat completions return
content: ""withfinish_reason: "length" - 20 completion tokens generated but all produce empty/NaN logits
What we know
✅ Confirmed working
- MoE expert CuTeDSL kernel — cosine 0.988, cudagraph-safe, production-ready
- All NVFP4 weights dequantize correctly to BF16 — standalone test proves it
- Full attention weight chain produces valid output (embed → q_a → norm → q_b → o_a → o_b)
- Post-quant fix runs at the right time — patched
utils.pycalls_post_quant_fix()afterprocess_weights_after_loading - 183 attention projections dequantized to BF16 (61 layers × 3 projs)
❌ Still broken
- Even with BF16 attention, model produces empty output
- Shared experts also use
FlashInferCutlassNvFp4LinearKernelwith brokeninput_scale - Added shared experts to BF16 dequant fix (122 more projections) — testing in progress
🔥 The real problem: vLLM's NVFP4 kernels are untrustworthy on B200
We spent the entire day fighting vLLM's FlashInferCutlassNvFp4LinearKernel:
- Broken
input_scale→ NaN process_weights_after_loadingtiming issues- Forward hooks not firing due to torch.compile/model wrappers
- Dequant-to-BF16 workaround is a bandaid that loses NVFP4 benefits
We could have built our own kernel in the time we spent debugging theirs.
Revised Plan: Our Own NVFP4 Kernels
Goal: Replace ALL vLLM NVFP4 kernel paths with our own CuTeDSL implementations. No more FlashInferCutlassNvFp4LinearKernel. No more BF16 dequant workarounds.
Phase 0: Get the BF16 fix working (current)
- Post-quant BF16 dequant for attention + shared experts
- Verify the model produces actual text output
- This is the "make it work" step
Phase 1: CuTeDSL Shared Expert Kernel
Priority: High — shared experts are the last NVFP4 component using vLLM's broken kernel
Files to create:
cutedsl/shared_expert_pipeline.py— L1 GEMM → SiLU → re-quant → L2 GEMM- Same pattern as MoE but simpler: no routing, no topk, no scatter
gate_up_projalready stacked (same as MoE L1)down_projsame as MoE L2
vllm/nvfp4_shared_expert.py— runner class- Cudagraph-safe (pre-allocated buffers)
- Warmup-based gs computation (same as MoE)
- Called from
DeepseekV4MoE.forward()for shared expert path
tests/test_shared_expert.py— standalone test- Load shared expert weights from checkpoint
- CuTeDSL vs BF16 reference (cosine)
- Cudagraph test
Why it's easy: Shared experts are literally MoE with 1 expert and no routing. The CuTeDSL ScaledGroupedGemmKernel with num_groups=1 is just a regular GEMM.
Phase 2: CuTeDSL Attention Kernel
Priority: High — attention is the biggest remaining NVFP4 component
Components to handle:
fused_wqa_wkv— MergedColumnParallelLinear (q_a + kv fused)wq_b— ColumnParallelLinear (second Q projection)wo_a— currently FP8 via fp8_einsumwo_b— ColumnParallelLinear (output projection)
Design options:
- Separate GEMMs — one CuTeDSL GEMM per projection, simplest
- Fused attention GEMM — batch all projections together (more complex, more speed)
Recommended: Start with option 1. Each projection is just a standard NVFP4 GEMM. No need to fuse. We can optimize later.
Files to create:
cutedsl/attention_pipeline.py— NVFP4 GEMMs for each attention projectionvllm/nvfp4_attention.py— runner class- Handles q_a_proj, kv_proj, q_b_proj, o_a_proj, o_b_proj
- Cudagraph-safe
- Warmup gs for each projection
tests/test_attention_nvfp4.py— standalone test
Challenge: fused_wqa_wkv has TWO weight_scale_2 values (one for q_a, one for kv). Need to handle dual global scales (same pattern as MoE gate+up with different gs).
Phase 3: Clean up
- Remove all BF16 dequant code
- Remove
vllm/patches/utils.pypatch - Remove
_post_quant_fix()method - All NVFP4 goes through our CuTeDSL kernels
- BF16 only where it must be (SiLU activation, final scatter, embeddings)
NVFP4 Kernel Coverage (Target)
| Component | Kernel | Status |
|---|---|---|
| MoE experts (L1+L2) | CuTeDSL ScaledGroupedGemm | ✅ Working |
| Shared experts (L1+L2) | CuTeDSL standard GEMM | 🔧 Phase 1 |
| Attention projections | CuTeDSL standard GEMM | 🔧 Phase 2 |
| wo_a | CuTeDSL or keep FP8 | 🔧 Phase 2 |
| Compressor | BF16 (small, not worth it) | ✅ Done |
| KV cache | FP8 (vLLM, not our concern) | ✅ Works |
Config values
| Parameter | Value |
|---|---|
| head_dim | 512 |
| num_attention_heads | 128 |
| num_key_value_heads | 1 |
| q_lora_rank | 1536 |
| qk_rope_head_dim | 64 |
| o_lora_rank | 1024 |
| hc_mult | 4 |
| n_routed_experts | 384 (48 per EP rank) |
| shared expert gate_proj | [3072, 3584] = 11MB NVFP4 / 22MB BF16 |
| shared expert up_proj | [3072, 3584] = 11MB NVFP4 / 22MB BF16 |
| shared expert down_proj | [7168, 1536] = 11MB NVFP4 / 22MB BF16 |
| shared expert total | 33MB NVFP4 / 66MB BF16 per layer, ~2GB / ~4GB total |