Files

biondizzle b3451c74f8 Update README and CURRENT_BUG.md with current state

- README: updated NVFP4 coverage table, status, and plan
- CURRENT_BUG.md: full debugging journey, what works, what's next
- Both reflect decision to build our own CuTeDSL kernels

2026-05-18 20:05:03 +00:00

3.9 KiB

Raw Blame History

Current State: Building our own NVFP4 kernels

Status: WIP — shared expert CuTeDSL kernel in progress Date: 2026-05-18

What happened today

Spent the day debugging why vLLM produces empty/garbage output. The journey:

NaN from layer 0 — diagnostic prints showed NaN from the very first layer
MoE kernel is fine — standalone test: cosine 0.988, no NaN
Root cause: FlashInferCutlassNvFp4LinearKernel uses broken input_scale — checkpoint values cause 3977x amplification during activation quantization → NaN
BF16 dequant fix — dequantize NVFP4 weights to BF16, replace quant method
process_weights_after_loading timing — our fix runs inside load_weights(), but vLLM's quant method runs AFTER. Fix gets overwritten.
Post-quant hook approach — forward pre-hooks don't fire (torch.compile + model wrappers bypass them)
Patched utils.py — added _post_quant_fix() call at end of process_weights_after_loading. This works — 305 projections dequantized to BF16.
Still garbage — even with 183 attention + 122 shared expert projections in BF16, output is still empty.
Conclusion: vLLM's pipeline has deeper issues. The FlashInferCutlassNvFp4LinearKernel is untrustworthy on B200 (same class of C++ CUTLASS FP4 bugs we hit with MoE). BF16 dequant doesn't fix it because something else is broken in vLLM's execution path.

Decision: Build our own NVFP4 kernels for shared experts and attention. Same CuTeDSL approach that works for MoE. Stop fighting vLLM's broken kernels.

Confirmed Working

Component	Kernel	Status
MoE experts (384/layer)	CuTeDSL ScaledGroupedGemm	✅ cosine 0.988, cudagraph-safe
All NVFP4 weights	Dequant to BF16	✅ Valid output in standalone test
Full attention weight chain	BF16 matmul	✅ No NaN, no zeros

In Progress

Component	Kernel	Status
Shared experts	CuTeDSL GEMM (1 group)	🔧 Runner WIP, scale assembly needs fixing
Attention projections	CuTeDSL GEMM	📋 Next after shared experts

WIP: Shared Expert CuTeDSL Kernel

Files:

cutedsl/shared_expert_pipeline.py — dedicated runner (needs scale assembly fix)
tests/test_shared_expert.py — standalone test

Issue: Tried reusing MoE runner with num_experts=1 — fails because MoE runner's scatter assumes hidden_size != HC_DIM. The MoE runner does output.scatter_add_ which expects expert output shape [tokens, hidden_size] but shared expert operates on HC_DIM (28672).

Fix needed: Dedicated runner with correct scale assembly for num_groups=1. The MoE runner's _assemble_scales_cudagraph_safe is the template. For a single group:

No expert offsets needed
No scatter needed (all tokens go to the same expert)
Scale assembly is just: quantize activation → pad to 128-row alignment → Blackwell swizzle
Simpler than the MoE case

Plan

Phase 1: Shared Expert Kernel (WIP)

Fix shared_expert_pipeline.py — implement scale assembly for num_groups=1
Test with test_shared_expert.py — target cosine ≥ 0.98 vs BF16 reference
Add cudagraph test
Wire into vLLM via DeepseekV4MoE.forward()

Phase 2: Attention NVFP4 Kernel

Each attention projection is a standard NVFP4 GEMM
fused_wqa_wkv has dual weight_scale_2 (same as MoE gate+up)
Handle wo_a — currently FP8, could stay FP8 or go native NVFP4
Test each projection individually, then integrate

Phase 3: Clean Up

Remove all BF16 dequant code
Remove vllm/patches/utils.py patch
Remove _post_quant_fix()
All NVFP4 through CuTeDSL, no vLLM FlashInfer kernels

Memory Layout

Component	NVFP4 Size	BF16 Size	Notes
Shared expert (per layer)	33MB	66MB	Small, 2GB total
Attention (per layer)	~TBD	~TBD	5 projections
MoE experts (per layer)	~TBD	~TBD	48 experts, stays NVFP4

3.9 KiB Raw Blame History