README: comprehensive update with current kernel status

This commit is contained in:
2026-05-20 04:42:57 +00:00
parent a30d9eb523
commit 06bf4f482d

153
README.md
View File

@@ -22,60 +22,100 @@ vLLM's internal kernels (FlashMLA, fp8_ds_mla, fused compressor, Triton indexer)
---
## What We Have
## Kernel Status
### ✅ CuTeDSL NVFP4 Grouped GEMM (the building block)
`ScaledGroupedGemmKernel` in `cutedsl/kernel/moe/torch_scaled_grouped_mm.py` — a production-grade NVFP4 grouped GEMM kernel:
`ScaledGroupedGemmKernel` in `cutedsl/kernel/moe/torch_scaled_grouped_mm.py`:
- 2D×3D scenario: A(M,K) × B(E,K,N) → C(M,N)
- Block-scaled: per-16-element FP8 scales on both A and B sides
- Global scales (per-expert) for full dynamic range
- Persistent scheduler, TMA pipelining, SMEM swizzle
- CUDAGraph-safe (workspace pre-allocated, no runtime allocations)
### ✅ Bridge Layer (`cutedsl/bridge.py`)
### ✅ Fused SwiGLU GEMM (L1 gate+up with SwiGLU in registers)
- `quantize_to_nvfp4()` — BF16 → NVFP4 with global scale
- `quantize_activation_nvfp4()` — cudagraph-safe quantize (pre-computed gs)
- `quantize_weight_to_nvfp4()` — weight quantization (along K dim)
- `interleave_l1_weights()` / `deinterleave_l1_weights()` — gate/up interleave at granularity 8 BF16
- `make_b_k_major()` — B tensor stride conversion
- `assemble_scales_2d_side()` / `assemble_scales_3d_side()` — scale assembly + swizzle
- `warmup_compilation()` / `warmup_fused_swiglu_compilation()` — eager JIT compilation
- `run_nvfp4_grouped_gemm()` / `run_fused_swiglu_grouped_gemm()` — kernel entry points
`FusedSwiGLUScaledGroupedGemmKernel` in `cutedsl/kernel/moe/fused_swiglu_grouped_mm.py`:
- Extends the base GEMM with an in-epilogue SwiGLU
- **Weight interleave**: `interleave_l1_weights()` interleaves gate/up at granularity 8 BF16
- **epi_tile=(128, 8)**: each 8-wide subtile is pure gate or pure up
- **Subtile-level pairing**: even subtiles = gate (SiLU in FP32, save to register buffer), odd subtiles = up (load silu(gate) from buffer, compute silu(gate)*up)
- Output: BF16 with interleaved [silu(gate), silu(gate)*up] at granularity 8
- **Cosine 0.988** vs BF16 reference (full MoE pipeline)
### ✅ MoE Runner (`cutedsl/runner.py`)
### ✅ Custom CUDA De-interleave + NVFP4 Quantize
`CuTeDSLMoERunner` — runs the MoE forward pass:
1. Quantize input BF16 → NVFP4 (using pre-computed gs)
2. L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up interleaved, de-interleave then split)
3. SiLU(gate) * up → BF16 (PyTorch — being replaced by fused kernel)
4. Re-quantize BF16 → NVFP4
5. L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
6. Scatter with routing weights
`cutedsl/kernels/deinterleave_quantize.cu`:
- Single GPU kernel: reads fused L1 BF16 output, extracts SwiGLU from odd 8-col groups, quantizes to NVFP4
- Replaces the Python `deinterleave_l1_weights()` + `quantize_activation_nvfp4()` path
- **4.3x faster** (0.043ms vs 0.184ms for 128 tokens)
- **99.97% cosine match** with Python reference, 99.7% FP4 byte match
- Saves ~8.5ms over 60 MoE layers
### ✅ NVFP4 Linear (`cutedsl/nvfp4_linear.py`)
`CuTeDSLNvfp4Linear` — single-expert NVFP4 GEMM for shared experts and attention projections.
### ✅ Fused SwiGLU Kernel (Stage 1: BF16 output)
### ✅ Blackwell Attention (standalone, not yet in vLLM)
`fused_swiglu_grouped_mm.py` — extends `ScaledGroupedGemmKernel` with a fused SwiGLU epilogue:
- **Weight interleave**: L1 gate/up weights interleaved at granularity 8 BF16
- **epi_tile=(128, 8)**: each 8-wide subtile is pure gate or pure up
- **Subtile-level pairing**: even subtiles = gate (compute SiLU, save to register buffer), odd subtiles = up (load SiLU(gate) from buffer, compute silu(gate)*up)
- **Stage 1 DONE**: BF16 output with SwiGLU, cosine 0.977 vs BF16 reference
- **Stage 2 NEXT**: NVFP4 quantize in epilogue, direct FP4 TMA store for L2
- `cutedsl/blackwell_attention.py` — KV cache write/read, full attention pipeline
- `cutedsl/csa_attention.py` — CSA (cr=4) and HCA (cr=128) sparse attention
- All standalone tests pass: KV cache (0.9997), CSA/HCA, prefill+decode (0.9998)
---
## Bridge Layer (`cutedsl/bridge.py`)
Quantization, layout, kernel launch utilities:
| Function | Purpose |
|----------|---------|
| `quantize_to_nvfp4()` | BF16 → NVFP4 with global scale |
| `quantize_activation_nvfp4()` | CUDAGraph-safe quantize (pre-computed gs) |
| `quantize_weight_to_nvfp4()` | Weight quantization along K dim |
| `interleave_l1_weights()` | Gate/up interleave at granularity 8 BF16 |
| `deinterleave_l1_weights()` | Reverse the interleave |
| `deinterleave_quantize_nvfp4_cuda()` | Custom CUDA: de-interleave + quantize in one kernel |
| `make_b_k_major()` | B tensor stride conversion |
| `assemble_scales_2d_side()` / `assemble_scales_3d_side()` | Scale assembly + swizzle |
| `warmup_compilation()` | Eager JIT compilation (base GEMM) |
| `warmup_fused_swiglu_compilation()` | Eager JIT compilation (fused SwiGLU) |
| `run_nvfp4_grouped_gemm()` | Base GEMM entry point |
| `run_fused_swiglu_grouped_gemm()` | Fused SwiGLU GEMM entry point |
---
## MoE Pipeline
### Non-Fused Path
`CuTeDSLMoERunner` / `run_nvfp4_moe()`:
1. Quantize input BF16 → NVFP4 (pre-computed gs)
2. L1 GEMM: NVFP4 × NVFP4 → BF16 (gate+up interleaved)
3. De-interleave, split gate/up
4. SiLU(gate) * up → BF16 (PyTorch)
5. Re-quantize BF16 → NVFP4
6. L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
7. Scatter with routing weights
### Fused Path
`run_nvfp4_moe_fused()` / `CuTeDSLMoERunner(fused_swiglu=True)`:
1. Quantize input BF16 → NVFP4 (pre-computed gs)
2. **Fused L1 GEMM + SwiGLU** in kernel registers → BF16 TMA store
3. **Custom CUDA kernel**: de-interleave + NVFP4 quantize (0.043ms)
4. L2 GEMM: NVFP4 × NVFP4 → BF16 (down_proj)
5. Scatter with routing weights
**Both paths: cosine 0.988 vs BF16 reference.** Fused path is marginally more accurate (FP32 SiLU in registers vs PyTorch BF16 SiLU).
---
## Correctness Bugs Fixed (May 20, 2026)
All 5 bugs fixed, committed, pushed:
| Bug | Issue | Fix |
|-----|-------|-----|
| 1 | `_needs_token_refill` myth — cute.compile doesn't corrupt GPU memory | Removed hack, added `warmup_compilation()`, pre-allocated workspace per cache entry |
| 1 | `_needs_token_refill` myth — cute.compile doesn't corrupt GPU memory | Removed hack, pre-allocated workspace per cache entry |
| 2 | Dequantize→requantize supposedly lossy | Verified 100% byte-identical round-trip. Deprecated `prepare_weights_from_dequantized` |
| 3 | `clamp(min=1e-8)` on zero blocks gives nonzero FP8 scale | Detect zero blocks, force FP8 scale to exact 0 |
| 4 | Underflow blocks (amax < 6×2⁻⁹) get nonzero FP4 from div-by-tiny-number | Detect underflow blocks, zero x_norm before division |
@@ -102,45 +142,46 @@ With `epi_tile_n=8`, each epilogue subtile covers exactly 8 BF16 N-columns. So e
```
for subtile_idx in range(subtile_cnt):
acc_vec = load_accumulator(subtile_idx)
acc_vec_bf16 = acc_vec.to(bf16) # init before dynamic if
if even (gate):
silu_result = silu(acc_vec)
silu_gate_buf = silu_result # save to register buffer
silu_result = silu(acc_vec) # FP32 math
silu_gate_buf = silu_result # save to register buffer
acc_vec_bf16 = silu_result
if odd (up):
gate_vals = silu_gate_buf # from previous iteration
gate_vals = silu_gate_buf # from previous iteration
acc_vec_bf16 = gate_vals * acc_vec # SwiGLU
store_to_smem(acc_vec_bf16)
tma_store_to_gmem()
```
No runtime conditional affects tensor structure. The `silu_gate_buf` is a register buffer initialized before the loop. Both branches produce `acc_vec_bf16` of the same type.
Both branches produce `acc_vec_bf16` of the same BF16 type. No runtime conditional affects tensor structure. The `silu_gate_buf` is a register buffer initialized before the loop.
**The output** has interleaved [silu(gate), silu(gate)*up] at granularity 8. De-interleave recovers the standard [silu(gate) | silu(gate)*up] layout. The up columns contain the SwiGLU result.
**The output** has interleaved [silu(gate), silu(gate)*up] at granularity 8. The custom CUDA kernel extracts odd 8-col groups (the SwiGLU result) and quantizes to NVFP4 for the L2 GEMM.
### The `//2` Bug in `interleave_l1_weights`
### The `//2` Bug
The original function had `g = granularity_bf16 // 2`, which is correct for K-axis interleave (where FP4 byte-packing gives 2 BF16 per element along K). But we interleave along N, where each N-column = 1 BF16 column. The `//2` was a leftover that silently gave g=4 instead of g=8, producing granularity 4 instead of 8. **Fixed**: `g = granularity_bf16` (no `//2`).
`interleave_l1_weights` had `g = granularity_bf16 // 2`, correct for K-axis interleave (FP4 packing along K). But we interleave along N, where each N-column = 1 BF16 column. The `//2` was a K-axis leftover that silently gave g=4 instead of g=8. **Fixed**: `g = granularity_bf16` (no `//2`).
### CuTeDSL Runtime Conditionals
CuTeDSL **does** support runtime conditionals on register tensors — the rule is that both branches must produce the same tensor type (shape, layout, dtype). The earlier "blocked by type system" framing was wrong. The real issue was that the old code applied SiLU to ALL positions (just SiLU, not SwiGLU) and used `is_gate_subtile < num_gate_subtiles` which doesn't work with interleaved weights. With epi_tile_n=8 and subtile-level pairing, the conditional is clean: both branches produce `acc_vec_bf16` of the same BF16 type.
CuTeDSL **does** support runtime conditionals on register tensors — both branches must produce the same tensor type (shape, layout, dtype). The earlier "blocked by type system" framing was wrong. The real issue: the old code applied SiLU to ALL positions (just SiLU, not SwiGLU) and the mask-blending approach (`silu(both)*0.5`) is mathematically wrong. With epi_tile_n=8 and subtile-level pairing, the conditional is clean.
### The Global Scale Gotcha
The custom CUDA quantize kernel needs the **L2 activation global scale** (from the SwiGLU output), NOT the L1 input global scale. The L1 gs is based on the input magnitude (~0.1), while the SwiGLU output can be orders of magnitude larger. Passing the wrong gs causes the FP8 block scale to overflow, producing NaN. The runner pre-computes the L2 gs in `compute_activation_global_scales()` before CUDAGraph capture.
---
## Fused SwiGLU — Remaining Steps
## Remaining Work
| Step | What | Status |
|------|------|--------|
| 1 | Wire fused kernel into pipeline | ✅ Done |
| 2 | Custom CUDA de-interleave + quantize kernel (4.3x faster) | ✅ Done |
| 3 | In-epilogue NVFP4 quantize (replace BF16 TMA with FP4 TMA) | 🔨 Future optimization |
**Current pipeline:** Fused SwiGLU kernel → BF16 TMA store → Custom CUDA quantize → L2 GEMM
**Cosine:** 0.988 (non-fused) / 0.988 (fused) vs BF16 reference
**Quantize speedup:** 4.3x (0.043ms vs 0.184ms), saves ~8.5ms over 60 MoE layers
| What | Status | Notes |
|------|--------|-------|
| In-epilogue NVFP4 quantize (replace BF16 TMA with FP4 TMA) | 🔨 Future | Saves ~0.14ms/layer; requires register→GMEM mapping for FP4 output |
| GPU-native KV cache + attention for vLLM | 🔨 Next | All standalone kernels work; need vLLM backend wiring |
| vLLM model integration | 🔨 Next | Model definition, weight loading, attention backend |
---
@@ -163,17 +204,19 @@ Compress ratios by layer: alternating 128/4, layer 60 = 0 (SWA).
cutedsl/
├── bridge.py # Quantization, layout, kernel launch
├── nvfp4_linear.py # Single-expert NVFP4 GEMM runner
├── runner.py # MoE grouped GEMM runner
├── runner.py # MoE grouped GEMM runner (fused + non-fused)
├── blackwell_attention.py # KV cache + attention (standalone)
├── csa_attention.py # CSA/HCA attention
├── custom_ops.py # torch.autograd wrappers
├── moe_pipeline.py # Standalone test pipeline
├── moe_pipeline.py # Standalone test pipeline (fused + non-fused)
├── kernels/
│ └── deinterleave_quantize.cu # Custom CUDA: de-interleave + NVFP4 quantize
└── kernel/moe/
├── torch_scaled_grouped_mm.py # ScaledGroupedGemmKernel (the GEMM)
└── fused_swiglu_grouped_mm.py # FusedSwiGLUScaledGroupedGemmKernel
tests/
├── layertest.py # MoE layer test (PASS, 0.988 cosine)
├── layertest.py # MoE layer test — fused + non-fused (PASS, 0.988)
├── cudagraph_test.py # CUDAGraph test (PASS)
├── test_full_layer_b200.py # All NVFP4 projections (PASS, 0.994+)
├── test_v4_attention_b200.py # All 3 attention types (PASS)
@@ -185,11 +228,11 @@ tests/
---
## Key Lessons (Things We Fucked Up)
## Key Lessons
1. **⛔ NEVER assume CuTeDSL GPU tensors survive JIT compilation.** `cute.compile` zeroes GPU memory. Keep index/mapping tensors on CPU.
2. **⛔ NEVER nuke working code without understanding why it exists.** The cudagraph-safe functions exist because vLLM REQUIRES cudagraph.
2. **⛔ NEVER nuke working code without understanding why it exists.** CUDAGraph-safe functions exist because vLLM requires CUDAGraph.
3. **⛔ NEVER fabricate facts from MEMORY.md.** Verify what "works" means before citing it.
@@ -203,6 +246,8 @@ tests/
8. **⛔ CuTeDSL `if` branches must produce the same tensor type.** Both branches must yield identical (shape, layout, dtype). Initialize variables before the `if` — using values defined only inside a branch is not supported.
9. **⛔ The `//2` in interleave was a K-axis leftover.** FP4 packing is along K, not N. When interleaving along N, `g = granularity_bf16` (no `//2`). The bug silently gave granularity 4 instead of 8, which would have produced wrong register-level pairing.
9. **⛔ The `//2` in interleave was a K-axis leftover.** FP4 packing is along K, not N. When interleaving along N, `g = granularity_bf16` (no `//2`). The bug silently gave granularity 4 instead of 8.
10. **⛔ "SiLU on all positions" is NOT SwiGLU.** SwiGLU pairs silu(gate)*up. Applying SiLU to the full (M, 2×intermediate) output is just SiLU, producing wrong results. The pairing must be explicit.
10. **⛔ "SiLU on all positions" is NOT SwiGLU.** SwiGLU pairs silu(gate)*up. Applying SiLU to the full (M, 2×intermediate) output is just SiLU. The pairing must be explicit.
11. **⛔ The global scale must match the data being quantized.** Passing the L1 input gs to the SwiGLU quantize causes FP8 overflow → NaN. The gs must come from the SwiGLU output's magnitude.