CURRENT_BUG.md

# Current State: Building our own NVFP4 kernels

**Status:** WIP — shared expert CuTeDSL kernel in progress
**Date:** 2026-05-18

## What happened today

Spent the day debugging why vLLM produces empty/garbage output. The journey:

1. **NaN from layer 0** — diagnostic prints showed NaN from the very first layer
2. **MoE kernel is fine** — standalone test: cosine 0.988, no NaN
3. **Root cause: `FlashInferCutlassNvFp4LinearKernel` uses broken `input_scale`** — checkpoint values cause 3977x amplification during activation quantization → NaN
4. **BF16 dequant fix** — dequantize NVFP4 weights to BF16, replace quant method
5. **`process_weights_after_loading` timing** — our fix runs inside `load_weights()`, but vLLM's quant method runs AFTER. Fix gets overwritten.
6. **Post-quant hook approach** — forward pre-hooks don't fire (torch.compile + model wrappers bypass them)
7. **Patched `utils.py`** — added `_post_quant_fix()` call at end of `process_weights_after_loading`. This works — 305 projections dequantized to BF16.
8. **Still garbage** — even with 183 attention + 122 shared expert projections in BF16, output is still empty.
9. **Conclusion: vLLM's pipeline has deeper issues.** The `FlashInferCutlassNvFp4LinearKernel` is untrustworthy on B200 (same class of C++ CUTLASS FP4 bugs we hit with MoE). BF16 dequant doesn't fix it because something else is broken in vLLM's execution path.

**Decision: Build our own NVFP4 kernels for shared experts and attention.** Same CuTeDSL approach that works for MoE. Stop fighting vLLM's broken kernels.

## Confirmed Working

| Component | Kernel | Status |
|-----------|--------|--------|
| MoE experts (384/layer) | CuTeDSL ScaledGroupedGemm | ✅ cosine 0.988, cudagraph-safe |
| All NVFP4 weights | Dequant to BF16 | ✅ Valid output in standalone test |
| Full attention weight chain | BF16 matmul | ✅ No NaN, no zeros |

## In Progress

| Component | Kernel | Status |
|-----------|--------|--------|
| Shared experts | CuTeDSL GEMM (1 group) | 🔧 Runner WIP, scale assembly needs fixing |
| Attention projections | CuTeDSL GEMM | 📋 Next after shared experts |

## WIP: Shared Expert CuTeDSL Kernel

**Files:**
- `cutedsl/shared_expert_pipeline.py` — dedicated runner (needs scale assembly fix)
- `tests/test_shared_expert.py` — standalone test

**Issue:** Tried reusing MoE runner with `num_experts=1` — fails because MoE runner's scatter assumes `hidden_size != HC_DIM`. The MoE runner does `output.scatter_add_` which expects expert output shape `[tokens, hidden_size]` but shared expert operates on HC_DIM (28672).

**Fix needed:** Dedicated runner with correct scale assembly for `num_groups=1`. The MoE runner's `_assemble_scales_cudagraph_safe` is the template. For a single group:
- No expert offsets needed
- No scatter needed (all tokens go to the same expert)
- Scale assembly is just: quantize activation → pad to 128-row alignment → Blackwell swizzle
- Simpler than the MoE case

## Plan

### Phase 1: Shared Expert Kernel (WIP)
1. Fix `shared_expert_pipeline.py` — implement scale assembly for num_groups=1
2. Test with `test_shared_expert.py` — target cosine ≥ 0.98 vs BF16 reference
3. Add cudagraph test
4. Wire into vLLM via `DeepseekV4MoE.forward()`

### Phase 2: Attention NVFP4 Kernel
- Each attention projection is a standard NVFP4 GEMM
- `fused_wqa_wkv` has dual weight_scale_2 (same as MoE gate+up)
- Handle `wo_a` — currently FP8, could stay FP8 or go native NVFP4
- Test each projection individually, then integrate

### Phase 3: Clean Up
- Remove all BF16 dequant code
- Remove `vllm/patches/utils.py` patch
- Remove `_post_quant_fix()` 
- All NVFP4 through CuTeDSL, no vLLM FlashInfer kernels

## Memory Layout

| Component | NVFP4 Size | BF16 Size | Notes |
|-----------|-----------|-----------|-------|
| Shared expert (per layer) | 33MB | 66MB | Small, 2GB total |
| Attention (per layer) | ~TBD | ~TBD | 5 projections |
| MoE experts (per layer) | ~TBD | ~TBD | 48 experts, stays NVFP4 |
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`# Current State: Building our own NVFP4 kernels`
Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager) When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation. 2026-05-18 12:51:51 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`Status: WIP — shared expert CuTeDSL kernel in progress`
Add layer-by-layer diagnostic prints (CLAWMINE_DEBUG=1, enforce-eager) When CLAWMINE_DEBUG=1, prints amax/mean/NaN/Inf after each layer. Must run with --enforce-eager (data-dependent prints break Dynamo). Gated by os.environ so dead-code-eliminated during compilation. 2026-05-18 12:51:51 +00:00			`Date: 2026-05-18`

Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`## What happened today`
Fix: dequantize ALL attention NVFP4 projections to BF16 Root cause of NaN from layer 0: FlashInferCutlassNvFp4LinearKernel uses checkpoint input_scale for activation quantization, which produces NaN immediately. Fix: dequantize all attention NVFP4 weights (wq_a, wq_b, wkv, wo_a, wo_b) to BF16 at load time, bypassing the broken input_scale entirely. Uses existing _dequant_nvfp4_to_bf16 method. This trades memory for correctness. Future optimization: add warmup for attention input_global_scale_inv (same as MoE warmup). 2026-05-18 13:09:36 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`Spent the day debugging why vLLM produces empty/garbage output. The journey:`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`1. NaN from layer 0 — diagnostic prints showed NaN from the very first layer`
			`2. MoE kernel is fine — standalone test: cosine 0.988, no NaN`
			3. Root cause: `FlashInferCutlassNvFp4LinearKernel` uses broken `input_scale` — checkpoint values cause 3977x amplification during activation quantization → NaN
			`4. BF16 dequant fix — dequantize NVFP4 weights to BF16, replace quant method`
			5. `process_weights_after_loading` timing — our fix runs inside `load_weights()`, but vLLM's quant method runs AFTER. Fix gets overwritten.
			`6. Post-quant hook approach — forward pre-hooks don't fire (torch.compile + model wrappers bypass them)`
			7. Patched `utils.py` — added `_post_quant_fix()` call at end of `process_weights_after_loading`. This works — 305 projections dequantized to BF16.
			`8. Still garbage — even with 183 attention + 122 shared expert projections in BF16, output is still empty.`
			9. Conclusion: vLLM's pipeline has deeper issues. The `FlashInferCutlassNvFp4LinearKernel` is untrustworthy on B200 (same class of C++ CUTLASS FP4 bugs we hit with MoE). BF16 dequant doesn't fix it because something else is broken in vLLM's execution path.
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`Decision: Build our own NVFP4 kernels for shared experts and attention. Same CuTeDSL approach that works for MoE. Stop fighting vLLM's broken kernels.`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`## Confirmed Working`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`\| Component \| Kernel \| Status \|`
			`\|-----------\|--------\|--------\|`
			`\| MoE experts (384/layer) \| CuTeDSL ScaledGroupedGemm \| ✅ cosine 0.988, cudagraph-safe \|`
			`\| All NVFP4 weights \| Dequant to BF16 \| ✅ Valid output in standalone test \|`
			`\| Full attention weight chain \| BF16 matmul \| ✅ No NaN, no zeros \|`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`## In Progress`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`\| Component \| Kernel \| Status \|`
			`\|-----------\|--------\|--------\|`
			`\| Shared experts \| CuTeDSL GEMM (1 group) \| 🔧 Runner WIP, scale assembly needs fixing \|`
			`\| Attention projections \| CuTeDSL GEMM \| 📋 Next after shared experts \|`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`## WIP: Shared Expert CuTeDSL Kernel`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`Files:`
			- `cutedsl/shared_expert_pipeline.py` — dedicated runner (needs scale assembly fix)
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00			- `tests/test_shared_expert.py` — standalone test

Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			Issue: Tried reusing MoE runner with `num_experts=1` — fails because MoE runner's scatter assumes `hidden_size != HC_DIM`. The MoE runner does `output.scatter_add_` which expects expert output shape `[tokens, hidden_size]` but shared expert operates on HC_DIM (28672).
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			Fix needed: Dedicated runner with correct scale assembly for `num_groups=1`. The MoE runner's `_assemble_scales_cudagraph_safe` is the template. For a single group:
			`- No expert offsets needed`
			`- No scatter needed (all tokens go to the same expert)`
			`- Scale assembly is just: quantize activation → pad to 128-row alignment → Blackwell swizzle`
			`- Simpler than the MoE case`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`## Plan`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`### Phase 1: Shared Expert Kernel (WIP)`
			1. Fix `shared_expert_pipeline.py` — implement scale assembly for num_groups=1
			2. Test with `test_shared_expert.py` — target cosine ≥ 0.98 vs BF16 reference
			`3. Add cudagraph test`
			4. Wire into vLLM via `DeepseekV4MoE.forward()`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`### Phase 2: Attention NVFP4 Kernel`
			`- Each attention projection is a standard NVFP4 GEMM`
			- `fused_wqa_wkv` has dual weight_scale_2 (same as MoE gate+up)
			- Handle `wo_a` — currently FP8, could stay FP8 or go native NVFP4
			`- Test each projection individually, then integrate`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`### Phase 3: Clean Up`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00			`- Remove all BF16 dequant code`
			- Remove `vllm/patches/utils.py` patch
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			- Remove `_post_quant_fix()`
			`- All NVFP4 through CuTeDSL, no vLLM FlashInfer kernels`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`## Memory Layout`
WIP: CuTeDSL shared expert kernel Dedicated runner (shared_expert_pipeline.py) and test (test_shared_expert.py). Tried reusing MoE runner with 1 expert — fails because MoE runner assumes hidden_size != HC_DIM for scatter. Need dedicated runner with correct scale assembly. Will continue tomorrow. 2026-05-18 20:02:19 +00:00
Update README and CURRENT_BUG.md with current state - README: updated NVFP4 coverage table, status, and plan - CURRENT_BUG.md: full debugging journey, what works, what's next - Both reflect decision to build our own CuTeDSL kernels 2026-05-18 20:05:03 +00:00			`\| Component \| NVFP4 Size \| BF16 Size \| Notes \|`
			`\|-----------\|-----------\|-----------\|-------\|`
			`\| Shared expert (per layer) \| 33MB \| 66MB \| Small, 2GB total \|`
			`\| Attention (per layer) \| ~TBD \| ~TBD \| 5 projections \|`
			`\| MoE experts (per layer) \| ~TBD \| ~TBD \| 48 experts, stays NVFP4 \|`