- README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels
78 lines
3.9 KiB
Markdown
78 lines
3.9 KiB
Markdown
# Current State: Building our own NVFP4 kernels
|
|
|
|
**Status:** WIP — shared expert CuTeDSL kernel in progress
|
|
**Date:** 2026-05-18
|
|
|
|
## What happened today
|
|
|
|
Spent the day debugging why vLLM produces empty/garbage output. The journey:
|
|
|
|
1. **NaN from layer 0** — diagnostic prints showed NaN from the very first layer
|
|
2. **MoE kernel is fine** — standalone test: cosine 0.988, no NaN
|
|
3. **Root cause: `FlashInferCutlassNvFp4LinearKernel` uses broken `input_scale`** — checkpoint values cause 3977x amplification during activation quantization → NaN
|
|
4. **BF16 dequant fix** — dequantize NVFP4 weights to BF16, replace quant method
|
|
5. **`process_weights_after_loading` timing** — our fix runs inside `load_weights()`, but vLLM's quant method runs AFTER. Fix gets overwritten.
|
|
6. **Post-quant hook approach** — forward pre-hooks don't fire (torch.compile + model wrappers bypass them)
|
|
7. **Patched `utils.py`** — added `_post_quant_fix()` call at end of `process_weights_after_loading`. This works — 305 projections dequantized to BF16.
|
|
8. **Still garbage** — even with 183 attention + 122 shared expert projections in BF16, output is still empty.
|
|
9. **Conclusion: vLLM's pipeline has deeper issues.** The `FlashInferCutlassNvFp4LinearKernel` is untrustworthy on B200 (same class of C++ CUTLASS FP4 bugs we hit with MoE). BF16 dequant doesn't fix it because something else is broken in vLLM's execution path.
|
|
|
|
**Decision: Build our own NVFP4 kernels for shared experts and attention.** Same CuTeDSL approach that works for MoE. Stop fighting vLLM's broken kernels.
|
|
|
|
## Confirmed Working
|
|
|
|
| Component | Kernel | Status |
|
|
|-----------|--------|--------|
|
|
| MoE experts (384/layer) | CuTeDSL ScaledGroupedGemm | ✅ cosine 0.988, cudagraph-safe |
|
|
| All NVFP4 weights | Dequant to BF16 | ✅ Valid output in standalone test |
|
|
| Full attention weight chain | BF16 matmul | ✅ No NaN, no zeros |
|
|
|
|
## In Progress
|
|
|
|
| Component | Kernel | Status |
|
|
|-----------|--------|--------|
|
|
| Shared experts | CuTeDSL GEMM (1 group) | 🔧 Runner WIP, scale assembly needs fixing |
|
|
| Attention projections | CuTeDSL GEMM | 📋 Next after shared experts |
|
|
|
|
## WIP: Shared Expert CuTeDSL Kernel
|
|
|
|
**Files:**
|
|
- `cutedsl/shared_expert_pipeline.py` — dedicated runner (needs scale assembly fix)
|
|
- `tests/test_shared_expert.py` — standalone test
|
|
|
|
**Issue:** Tried reusing MoE runner with `num_experts=1` — fails because MoE runner's scatter assumes `hidden_size != HC_DIM`. The MoE runner does `output.scatter_add_` which expects expert output shape `[tokens, hidden_size]` but shared expert operates on HC_DIM (28672).
|
|
|
|
**Fix needed:** Dedicated runner with correct scale assembly for `num_groups=1`. The MoE runner's `_assemble_scales_cudagraph_safe` is the template. For a single group:
|
|
- No expert offsets needed
|
|
- No scatter needed (all tokens go to the same expert)
|
|
- Scale assembly is just: quantize activation → pad to 128-row alignment → Blackwell swizzle
|
|
- Simpler than the MoE case
|
|
|
|
## Plan
|
|
|
|
### Phase 1: Shared Expert Kernel (WIP)
|
|
1. Fix `shared_expert_pipeline.py` — implement scale assembly for num_groups=1
|
|
2. Test with `test_shared_expert.py` — target cosine ≥ 0.98 vs BF16 reference
|
|
3. Add cudagraph test
|
|
4. Wire into vLLM via `DeepseekV4MoE.forward()`
|
|
|
|
### Phase 2: Attention NVFP4 Kernel
|
|
- Each attention projection is a standard NVFP4 GEMM
|
|
- `fused_wqa_wkv` has dual weight_scale_2 (same as MoE gate+up)
|
|
- Handle `wo_a` — currently FP8, could stay FP8 or go native NVFP4
|
|
- Test each projection individually, then integrate
|
|
|
|
### Phase 3: Clean Up
|
|
- Remove all BF16 dequant code
|
|
- Remove `vllm/patches/utils.py` patch
|
|
- Remove `_post_quant_fix()`
|
|
- All NVFP4 through CuTeDSL, no vLLM FlashInfer kernels
|
|
|
|
## Memory Layout
|
|
|
|
| Component | NVFP4 Size | BF16 Size | Notes |
|
|
|-----------|-----------|-----------|-------|
|
|
| Shared expert (per layer) | 33MB | 66MB | Small, 2GB total |
|
|
| Attention (per layer) | ~TBD | ~TBD | 5 projections |
|
|
| MoE experts (per layer) | ~TBD | ~TBD | 48 experts, stays NVFP4 |
|