Files

biondizzle e8b289e30d WIP: CuTeDSL shared expert kernel

Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py).
Tried reusing MoE runner with 1 expert — fails because MoE runner assumes
hidden_size != HC_DIM for scatter. Need dedicated runner with correct
scale assembly. Will continue tomorrow.

2026-05-18 20:02:19 +00:00

5.2 KiB

Raw Blame History

Current Bug: vLLM produces empty/garbage output

Status: Debugging, plan revised — building our own kernels Date: 2026-05-18

Symptom

vLLM server starts, loads model, processes requests (200 OK)
Chat completions return content: "" with finish_reason: "length"
20 completion tokens generated but all produce empty/NaN logits

What we know

✅ Confirmed working

MoE expert CuTeDSL kernel — cosine 0.988, cudagraph-safe, production-ready
All NVFP4 weights dequantize correctly to BF16 — standalone test proves it
Full attention weight chain produces valid output (embed → q_a → norm → q_b → o_a → o_b)
Post-quant fix runs at the right time — patched utils.py calls _post_quant_fix() after process_weights_after_loading
183 attention projections dequantized to BF16 (61 layers × 3 projs)

❌ Still broken

Even with BF16 attention, model produces empty output
Shared experts also use FlashInferCutlassNvFp4LinearKernel with broken input_scale
Added shared experts to BF16 dequant fix (122 more projections) — testing in progress

🔥 The real problem: vLLM's NVFP4 kernels are untrustworthy on B200

We spent the entire day fighting vLLM's FlashInferCutlassNvFp4LinearKernel:

Broken input_scale → NaN
process_weights_after_loading timing issues
Forward hooks not firing due to torch.compile/model wrappers
Dequant-to-BF16 workaround is a bandaid that loses NVFP4 benefits

We could have built our own kernel in the time we spent debugging theirs.

Revised Plan: Our Own NVFP4 Kernels

Goal: Replace ALL vLLM NVFP4 kernel paths with our own CuTeDSL implementations. No more FlashInferCutlassNvFp4LinearKernel. No more BF16 dequant workarounds.

Phase 0: Get the BF16 fix working (current)

Post-quant BF16 dequant for attention + shared experts
Verify the model produces actual text output
This is the "make it work" step

Phase 1: CuTeDSL Shared Expert Kernel

Priority: High — shared experts are the last NVFP4 component using vLLM's broken kernel

Files to create:

cutedsl/shared_expert_pipeline.py — L1 GEMM → SiLU → re-quant → L2 GEMM
- Same pattern as MoE but simpler: no routing, no topk, no scatter
- gate_up_proj already stacked (same as MoE L1)
- down_proj same as MoE L2
vllm/nvfp4_shared_expert.py — runner class
- Cudagraph-safe (pre-allocated buffers)
- Warmup-based gs computation (same as MoE)
- Called from DeepseekV4MoE.forward() for shared expert path
tests/test_shared_expert.py — standalone test
- Load shared expert weights from checkpoint
- CuTeDSL vs BF16 reference (cosine)
- Cudagraph test

Why it's easy: Shared experts are literally MoE with 1 expert and no routing. The CuTeDSL ScaledGroupedGemmKernel with num_groups=1 is just a regular GEMM.

Phase 2: CuTeDSL Attention Kernel

Priority: High — attention is the biggest remaining NVFP4 component

Components to handle:

fused_wqa_wkv — MergedColumnParallelLinear (q_a + kv fused)
wq_b — ColumnParallelLinear (second Q projection)
wo_a — currently FP8 via fp8_einsum
wo_b — ColumnParallelLinear (output projection)

Design options:

Separate GEMMs — one CuTeDSL GEMM per projection, simplest
Fused attention GEMM — batch all projections together (more complex, more speed)

Recommended: Start with option 1. Each projection is just a standard NVFP4 GEMM. No need to fuse. We can optimize later.