- README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels
3.9 KiB
3.9 KiB
Current State: Building our own NVFP4 kernels
Status: WIP — shared expert CuTeDSL kernel in progress Date: 2026-05-18
What happened today
Spent the day debugging why vLLM produces empty/garbage output. The journey:
- NaN from layer 0 — diagnostic prints showed NaN from the very first layer
- MoE kernel is fine — standalone test: cosine 0.988, no NaN
- Root cause:
FlashInferCutlassNvFp4LinearKerneluses brokeninput_scale— checkpoint values cause 3977x amplification during activation quantization → NaN - BF16 dequant fix — dequantize NVFP4 weights to BF16, replace quant method
process_weights_after_loadingtiming — our fix runs insideload_weights(), but vLLM's quant method runs AFTER. Fix gets overwritten.- Post-quant hook approach — forward pre-hooks don't fire (torch.compile + model wrappers bypass them)
- Patched
utils.py— added_post_quant_fix()call at end ofprocess_weights_after_loading. This works — 305 projections dequantized to BF16. - Still garbage — even with 183 attention + 122 shared expert projections in BF16, output is still empty.
- Conclusion: vLLM's pipeline has deeper issues. The
FlashInferCutlassNvFp4LinearKernelis untrustworthy on B200 (same class of C++ CUTLASS FP4 bugs we hit with MoE). BF16 dequant doesn't fix it because something else is broken in vLLM's execution path.
Decision: Build our own NVFP4 kernels for shared experts and attention. Same CuTeDSL approach that works for MoE. Stop fighting vLLM's broken kernels.
Confirmed Working
| Component | Kernel | Status |
|---|---|---|
| MoE experts (384/layer) | CuTeDSL ScaledGroupedGemm | ✅ cosine 0.988, cudagraph-safe |
| All NVFP4 weights | Dequant to BF16 | ✅ Valid output in standalone test |
| Full attention weight chain | BF16 matmul | ✅ No NaN, no zeros |
In Progress
| Component | Kernel | Status |
|---|---|---|
| Shared experts | CuTeDSL GEMM (1 group) | 🔧 Runner WIP, scale assembly needs fixing |
| Attention projections | CuTeDSL GEMM | 📋 Next after shared experts |
WIP: Shared Expert CuTeDSL Kernel
Files:
cutedsl/shared_expert_pipeline.py— dedicated runner (needs scale assembly fix)tests/test_shared_expert.py— standalone test
Issue: Tried reusing MoE runner with num_experts=1 — fails because MoE runner's scatter assumes hidden_size != HC_DIM. The MoE runner does output.scatter_add_ which expects expert output shape [tokens, hidden_size] but shared expert operates on HC_DIM (28672).
Fix needed: Dedicated runner with correct scale assembly for num_groups=1. The MoE runner's _assemble_scales_cudagraph_safe is the template. For a single group:
- No expert offsets needed
- No scatter needed (all tokens go to the same expert)
- Scale assembly is just: quantize activation → pad to 128-row alignment → Blackwell swizzle
- Simpler than the MoE case
Plan
Phase 1: Shared Expert Kernel (WIP)
- Fix
shared_expert_pipeline.py— implement scale assembly for num_groups=1 - Test with
test_shared_expert.py— target cosine ≥ 0.98 vs BF16 reference - Add cudagraph test
- Wire into vLLM via
DeepseekV4MoE.forward()
Phase 2: Attention NVFP4 Kernel
- Each attention projection is a standard NVFP4 GEMM
fused_wqa_wkvhas dual weight_scale_2 (same as MoE gate+up)- Handle
wo_a— currently FP8, could stay FP8 or go native NVFP4 - Test each projection individually, then integrate
Phase 3: Clean Up
- Remove all BF16 dequant code
- Remove
vllm/patches/utils.pypatch - Remove
_post_quant_fix() - All NVFP4 through CuTeDSL, no vLLM FlashInfer kernels
Memory Layout
| Component | NVFP4 Size | BF16 Size | Notes |
|---|---|---|---|
| Shared expert (per layer) | 33MB | 66MB | Small, 2GB total |
| Attention (per layer) | ~TBD | ~TBD | 5 projections |
| MoE experts (per layer) | ~TBD | ~TBD | 48 experts, stays NVFP4 |