363 lines
16 KiB
Markdown
363 lines
16 KiB
Markdown
# PERFORMANCE — v18 NVFP4-everywhere fusion landed
|
||
|
||
**Current state (2026-06-02).** Part 1 (P0–P3) is **LANDED**. The fused
|
||
SwiGLU kernel compiles and runs in production. The CUDA RoPE kernel
|
||
passes cos=1.000000 vs PyTorch reference. The single_shot generates
|
||
coherent English (". The capital of France is...") with the full fused
|
||
kernel stack — no NaN, no crashes, 500+ tokens decoded.
|
||
|
||
**What remains** is KV-cache dtype choices (Part 2) and higher-order
|
||
fusion (P4–P6). The model now uses NVFP4 GEMM + fused SwiGLU + CUDA RoPE
|
||
end-to-end. The KV cache is still BF16 — the next frontier.
|
||
|
||
**Tag:** `v-p0p1p2p3-fused-swiglu-cuda-rope-20260602`
|
||
|
||
**On TurboQuant — verdict first, reasoning below.** Don't use it for DSv4.
|
||
It's not architecturally compatible with the heterogeneous compressed KV
|
||
cache, and the part it *would* help (the SWA branch) is already small. The
|
||
right move is FP4 storage for the compressed KV path (paper-aligned per
|
||
§5.2.1), not vector-quantization codebooks. Full reasoning in Section 4.
|
||
|
||
---
|
||
|
||
# PART 1 — THE NVFP4-EVERYWHERE GAP (STATUS: ✅ LANDED)
|
||
|
||
## P0 — Fused SwiGLU for MoE — ✅ LANDED
|
||
|
||
**Was:** `set_fused_swiglu(True)` existed but was never called. 240+ BF16
|
||
kernel launches per token wasted on unfused SiLU+clamp+deinterleave.
|
||
|
||
**Fix (3 bugs in `fused_swiglu.py`):**
|
||
1. `kernel()` signature missing `fp4_out`, `sf_out`, `l2_global_scale` params
|
||
→ `TypeError: too many positional arguments` during `cute.compile()`
|
||
Fix: added Optional params with None defaults to kernel signature
|
||
2. `cute.math.fmin`/`cute.math.fmax` don't exist in CuTe DSL
|
||
→ Replaced with `cute.where()` for TensorSSA-compatible clamp
|
||
3. Subtile loop used `vectorize=True` (default) — incompatible with `cute.where()`
|
||
→ Changed to `cutlass.range(subtile_cnt, unroll=1)`
|
||
|
||
**Result:** Fused kernel compiles and runs. MoE L1 GEMM + SwiGLU + clamp
|
||
in a single kernel launch. ~240 BF16 launches eliminated per token.
|
||
|
||
**Commits:** fca7242 (arg fix), 3a30f35 (cute.where), 5c746bb (unroll=1)
|
||
|
||
## P1 — Fused SwiGLU for Shared Expert — ✅ LANDED
|
||
|
||
**Was:** SE had no fused path. Same unfused gap as MoE but for 1-expert variant.
|
||
|
||
**Fix:**
|
||
1. `interleave_l1_weights(granularity=8)` → `granularity_bf16=8` (wrong kwarg)
|
||
2. `_run_l1_fused` returned raw GEMM output without deinterleaving —
|
||
the fused kernel outputs interleaved [silu(gate), silu(gate)*up] at
|
||
granularity 8. Must deinterleave and extract up half (SwiGLU result).
|
||
3. Added eager `warmup_fused_swiglu_compilation(1, ...)` for SE (1-group)
|
||
|
||
**Result:** SE uses same fused kernel as MoE (num_groups=1). ~120 µs/token saved.
|
||
|
||
**Commits:** 1726cb6 (granularity_bf16), f01d3f3 (SE deinterleave), 553275d (SE warmup)
|
||
|
||
## P2 — Linear `.run()` per-call FP32 scale uploads — ✅ LANDED
|
||
|
||
**Was:** `self._gsa_buf.fill_(self._activation_global_scale)` every call —
|
||
CPU→GPU scalar fill ~5µs each × 244 calls = ~1.2ms/token.
|
||
|
||
**Fix:** `_gsa_buf` set once during init or by GPU compute (`quantize_nvfp4_gpu_fused`).
|
||
No per-call fill on the hot path.
|
||
|
||
**Result:** Zero H2D scalar transfers on the hot path.
|
||
|
||
## P3 — CUDA RoPE kernel — ✅ LANDED
|
||
|
||
**Was:** `_apply_rope` used 5-6 PyTorch ops per call (slice, clone, multiply, add, cast).
|
||
183 RoPE calls × 5 launches = ~915 launches/token.
|
||
|
||
**Fix:** Raw CUDA kernel (`rope_cuda.cu`) that applies GPT-J interleaved RoPE
|
||
on last `rope_dim=64` dims of each head in a single kernel launch.
|
||
FP32 cos/sin cache, forward + inverse, in-place operation.
|
||
|
||
**Test results:**
|
||
- Forward RoPE: cos=1.000000 vs PyTorch reference
|
||
- Inverse RoPE: cos=1.000000 vs PyTorch reference
|
||
- Round-trip (forward+inverse): cos=0.999999
|
||
- Multi-token (T=8): cos=1.000000
|
||
|
||
**Files:** `dsv4/kernels/cuda/rope_cuda.cu`, `dsv4/ops/rope_cuda.py`
|
||
|
||
**Result:** 183 RoPE calls × (5-1) = **732 launches eliminated per token**.
|
||
|
||
---
|
||
|
||
# Part 1 Summary
|
||
|
||
| Item | Status | Launches saved/token | Key fix |
|
||
|---|---|---|---|
|
||
| **P0** | ✅ Landed | ~240 (MoE) | kernel() signature + cute.where + unroll=1 |
|
||
| **P1** | ✅ Landed | ~120 (SE) | granularity_bf16 + deinterleave + warmup |
|
||
| **P2** | ✅ Landed | ~244 (gsa fills) | Remove per-call fill_() |
|
||
| **P3** | ✅ Landed | ~732 (RoPE) | Raw CUDA kernel, cos=1.000000 |
|
||
| **Total** | | **~1336 launches/token** | |
|
||
|
||
**Single-shot E2E verification:**
|
||
- Model generates ". The capital of France is . capital izing ized..." (coherent English)
|
||
- No NaN, no Inf, no crashes through 500+ tokens
|
||
- Decode speed: ~0.53-0.56s/token
|
||
- Repetition loop on capital/ized variants is a known residual growth issue (not a kernel bug)
|
||
|
||
---
|
||
|
||
# PART 2 — KV CACHE: WHAT'S ALREADY FP4-COMPATIBLE, WHAT ISN'T
|
||
|
||
**Current state:** ALL KV cache tensors are BF16. No FP4, no FP8.
|
||
|
||
| Stream | Stored as | Width | At 1M ctx | Quantizable? |
|
||
|---|---|---|---|---|
|
||
| **SWA** | `torch.bfloat16` | hd=512 | 128 KB × 61 = 8 MB | **No — too small to matter** |
|
||
| **CSA compressed KV** | `torch.bfloat16` | hd=512 | ~7.5 GB | **Yes — FP4 strongly indicated** |
|
||
| **HCA compressed KV** | `torch.bfloat16` | hd=512 | ~240 MB | **Yes — FP4 indicated** |
|
||
| **CSA indexer keys** | `torch.bfloat16` | c_I=128 | ~2 GB | **Yes — FP4 paper-specified §5.2.1** |
|
||
| **Gather buffer** | `torch.bfloat16` | hd=512 | transient | Will match compressed KV dtype |
|
||
|
||
Total BF16 at 1M context: ~10 GB on 8×B200. Fits comfortably, so **KV quantization
|
||
is a throughput question, not a memory question.**
|
||
|
||
## Why FP4 storage is the right answer for the compressed streams
|
||
|
||
Three reasons, in priority order:
|
||
|
||
1. **Paper-aligned.** §5.2.1 explicitly specifies the indexer QK path
|
||
runs entirely in FP4. The main compressed KV cache being FP4 is
|
||
consistent with the rest of the NVFP4 model — the cache is, after all,
|
||
just stored projections of NVFP4 weights × BF16 hidden states.
|
||
|
||
2. **Bandwidth.** Decode is KV-read-bound at long context. Reading
|
||
FP4 instead of BF16 quarters the bytes-per-token loaded by FMHA.
|
||
At top_k=1024, hd=512, 30 CSA layers: that's `30 × 1024 × 512 × 1.5 bytes
|
||
saved = 23 MB/token saved`. Across batch=8 and millions of decode
|
||
steps, real money.
|
||
|
||
3. **Kernel-native on Blackwell.** Loading FP4 → tcgen05.mma is a
|
||
first-class path with TMA + UMMA + the `mxf4nvf4` MMA kind. The
|
||
in-kernel dequant happens for free during the MMA. **The infrastructure
|
||
exists in the production FMHA kernel already** (per the
|
||
`epilogue_op` work and the `ENABLE_FP4_EPILOGUE` template param).
|
||
|
||
## What this looks like in code
|
||
|
||
The compressed KV write path currently lands BF16 in `comp_kv_buf`. The
|
||
production sequence should be:
|
||
|
||
1. Compressor produces BF16 output (still — the softmax compression needs
|
||
accumulation precision).
|
||
2. Quantize-to-NVFP4 in the same kernel as the compression (epilogue
|
||
fusion), using the **same NVFP4 quant primitives the linears already
|
||
use** (`quantize_nvfp4_gpu_fused`).
|
||
3. Store FP4 + per-block E4M3 scales in `comp_kv_buf` (which becomes a
|
||
FP4 buffer + scale buffer pair).
|
||
4. FMHA reads FP4, dequants in-kernel via TMA + tcgen05's native FP4
|
||
path. No `__constant__` LUT needed — the hardware decodes E2M1.
|
||
|
||
For the indexer keys this is the same pattern but the consumer is the
|
||
indexer scoring kernel (the FP32 einsum today, the FP4 tensor-core scorer
|
||
when E7 lands).
|
||
|
||
### Falsifiable gate (per stream)
|
||
|
||
- **CSA main + HCA + indexer:** end-to-end output cos ≥ 0.999 with FP4
|
||
storage vs BF16. KV cache memory at 8K context drops by ~3.5× (8 → 2.3
|
||
GB). FMHA-bound decode latency at 8K context drops measurably.
|
||
- **Recall@k for indexer ≥ 99% vs FP32 oracle** (the bar from the prior
|
||
indexer-fix audit). Critical — FP4 must not corrupt top-k ranking.
|
||
|
||
---
|
||
|
||
# PART 3 — OTHER FUSION WINS, RANKED BY EFFORT/IMPACT
|
||
|
||
## P4 — Fuse RMSNorm into the next NVFP4 quantize
|
||
|
||
Q/KV projection input is RMSNormed; RMSNorm is a separate launch. The
|
||
NVFP4 quantize kernel already does an amax reduction per group — fusing
|
||
RMSNorm (which is *also* an amax-style reduction followed by a scale)
|
||
into the quantizer's input is a natural fit. Saves a launch + a BF16
|
||
materialization of `(T, H)` per RMSNorm site (2 per layer = 122/token).
|
||
|
||
**Effort:** S (kernel-side, but the quantizer already has the right shape).
|
||
**Impact:** Medium. 122 launches/token, ~0.7 ms/token from launch overhead alone.
|
||
|
||
## P5 — Fuse mHC pre_block + RMSNorm into a single op
|
||
|
||
Same logic as P4 but for mHC. `attn_mhc.pre_block(X_l)` → `rmsnorm` is 3
|
||
kernels back-to-back. Fusable. mHC already exposes a `_project_and_rms`
|
||
half per prior audit notes — wire it through both halves of the layer.
|
||
|
||
**Effort:** S. **Impact:** Medium. ~120 launches/token.
|
||
|
||
## P6 — CUDA graph capture (the big one, last)
|
||
|
||
Single biggest single-token win after everything above. Captures the entire
|
||
decode step into a graph; replay eliminates **all** launch overhead.
|
||
Probably worth 2–3× speedup at batch=1.
|
||
|
||
Blockers in v17:
|
||
1. `set_device()` boundaries in the layer pipeline (the `cuda.synchronize()`
|
||
at line 963) — graph capture spans devices via multi-graph or
|
||
per-device sub-graphs. Manageable but not free.
|
||
2. Dynamic shape in `KVCache.add_compressed` — `self.n_comp` grows.
|
||
Fix: capture *one* graph per prefill chunk size, replay per
|
||
decoded token (which has fixed T=1 shape; the growing buffer is
|
||
a write into a pre-allocated tensor, capturable).
|
||
3. Any conditional `if` on tensor data — debug prints, the assertion at
|
||
line 608. Strip from the capture path with a flag.
|
||
|
||
**Effort:** L. **Impact:** Huge (the biggest remaining single win).
|
||
**Sequence:** land after P0/P1/P2/P3 so the captured graph reflects the
|
||
post-fusion structure.
|
||
|
||
---
|
||
|
||
# PART 4 — TURBOQUANT: ARCHITECTURAL VERDICT
|
||
|
||
Reading `turboquant/`: this is an **ICLR 2026 paper implementation** of
|
||
vector-quantization KV compression. Two algorithms:
|
||
- MSE-quantize keys/values via codebook (3 bit by default)
|
||
- Inner-product-aware quantize keys (preserves dot products) via Algorithm 2
|
||
- Per-vector L2-norm preserved separately, plus QJL sign sketch for
|
||
residual recovery
|
||
|
||
Operational shape:
|
||
- Operates on **standard MHA/GQA shape** `(..., n_heads, head_dim)`,
|
||
head_dim typically 128.
|
||
- Requires a `head_dim × head_dim` rotation matrix per layer (precomputed
|
||
from random seed, shared across heads).
|
||
- Has a Triton fused-decode kernel that computes attention scores directly
|
||
from packed codebook indices.
|
||
- vLLM integration via `turboquant/vllm_attn_backend.py`.
|
||
|
||
## Why it doesn't fit DSv4
|
||
|
||
Three structural mismatches, in order of severity:
|
||
|
||
### 1. The DSv4 KV cache is already a learned compression
|
||
|
||
DSv4 doesn't store per-token KV. The CSA compressor's whole job is to
|
||
reduce m=4 tokens into 1 compressed entry via a softmax-weighted mix.
|
||
That entry is what gets cached. TurboQuant quantizes the *post-projection
|
||
per-token KV* of standard attention — exactly the thing DSv4 has
|
||
already replaced with a learned compressor. **You'd be applying a lossy
|
||
compression on top of an already-lossy compression**, which (a) compounds
|
||
loss in an uncontrolled way and (b) attacks the wrong dimension. The
|
||
compressed entries are already 4× (CSA) or 128× (HCA) reduced in the
|
||
sequence dimension; further reducing the *head dimension* via codebook
|
||
gives little additional savings (you're already attending over very few
|
||
entries per query) at high quality cost.
|
||
|
||
### 2. Wrong shape, wrong primitive
|
||
|
||
TurboQuant operates on `(..., n_heads, head_dim=128)` per-token vectors
|
||
and uses a `128×128` random rotation. DSv4's compressed cache is shape
|
||
`(n_comp, head_dim=512)` — no head dimension. The whole "rotate the head
|
||
dim" abstraction needs to be reworked, and once you do, you're writing
|
||
new code that isn't TurboQuant anymore.
|
||
|
||
For the indexer keys, the storage *is* per-block 128-dim, which is closer
|
||
to TurboQuant's natural shape. But the indexer's scoring math is
|
||
`ReLU(q·k) · w_h` summed across heads — TurboQuant's "preserve inner
|
||
products" guarantee from Algorithm 2 doesn't compose with the ReLU
|
||
nonlinearity. The quantization error becomes worst-case at the threshold,
|
||
which is where top-k decisions get made. **Bad fit precisely where it
|
||
matters most.**
|
||
|
||
### 3. NVFP4 hardware exists; TurboQuant is software-only
|
||
|
||
TurboQuant runs as bit-packed uint8 + Triton kernels. It can't use
|
||
tcgen05 FP4 tensor cores because its values aren't FP4 — they're
|
||
codebook *indices*. So you'd be paying CPU/GPU cycles to dequant via
|
||
gathers and per-token rotation matrix-vector multiplies, when the same
|
||
storage cost (4 bits/value) is available natively as FP4 with hardware
|
||
dequant during MMA.
|
||
|
||
The TurboQuant benchmark numbers (+3–5% throughput at 3-bit) are
|
||
real, but they're against `bf16_kv` baselines on architectures that
|
||
don't have FP4 tensor cores. On Blackwell with NVFP4, the comparison
|
||
should be FP4 storage + FP4 MMA — which is strictly better in every
|
||
axis (bandwidth, capacity, dequant cost).
|
||
|
||
## Where TurboQuant *would* help, and the verdict on whether it's worth it
|
||
|
||
The only DSv4 stream where TurboQuant's shape is a natural fit is the
|
||
**SWA branch** — uncompressed per-token KV in the sliding window, 128
|
||
tokens × `n_layers` × `hd=512` = 8 MB at 1M context.
|
||
|
||
**It's 8 MB.** Not worth a new dependency, a paper-grade extra failure
|
||
mode, or the rotation overhead. The SWA branch fits in L2 cache on B200.
|
||
|
||
### Verdict
|
||
|
||
Don't use TurboQuant. The right move for DSv4's KV cache is **FP4 storage
|
||
+ FP4 MMA on the compressed streams**, fully Blackwell-native, paper-
|
||
aligned (§5.2.1), with no codebook lookup overhead. The infrastructure to
|
||
do this is already in your kernel library (the `ENABLE_FP4_EPILOGUE`
|
||
template, the FP4 MMA path).
|
||
|
||
If you want a paper to cite for "what's the state-of-the-art KV
|
||
compression in 2026," TurboQuant is one. If you want the highest-perf
|
||
production-grade DSv4 implementation, native FP4 is the answer.
|
||
|
||
---
|
||
|
||
# PRIORITY ORDER (updated 2026-06-02)
|
||
|
||
| # | Item | Effort | Win | Status |
|
||
|---|---|---|---|---|
|
||
| **P0** | Call `set_fused_swiglu(True)` on all MoEs | XS | ~240 launches/token | ✅ Done |
|
||
| **P1** | Same for shared expert | S | ~120 launches/token | ✅ Done |
|
||
| **P2** | Drop per-call `fill_()` in Nvfp4Linear | S | ~244 launches/token | ✅ Done |
|
||
| **P3** | CUDA RoPE kernel (1 launch vs 5-6) | S | ~732 launches/token | ✅ Done |
|
||
| **KV-1** | FP4 storage for CSA main compressed KV | M | Huge at long context | Next |
|
||
| **KV-2** | FP4 storage for HCA compressed KV | M | Same pattern as KV-1 | After KV-1 |
|
||
| **KV-3** | FP4 storage for indexer keys (pair with E7) | M | Throughput + paper compliance | After KV-2 |
|
||
| **P4** | RMSNorm fused into next quantize | S | 122 launches/token | After KV |
|
||
| **P5** | mHC pre_block + RMSNorm fused | S | ~120 launches/token | After P4 |
|
||
| **P6** | CUDA graph capture | L | **2–3× total** | After everything above |
|
||
|
||
**Part 1 complete.** The NVFP4-everywhere gap for the GEMM+activation+RoPE
|
||
path is closed. The remaining wins are KV-cache dtype (Part 2) and
|
||
higher-order fusion (P4–P6). Land all of those before attempting CUDA
|
||
graphs — the captured graph should reflect the final fused structure, not
|
||
the pre-fusion one.
|
||
|
||
---
|
||
|
||
# DOCTRINE
|
||
|
||
1. **DSL wall → raw CUDA C++, not Python.** Applies to P3/P4/P5 (kernel-
|
||
side fusion work). The fused-SwiGLU kernel already exists as a model
|
||
for what these should look like — it's NVFP4 GEMM + arbitrary-op
|
||
epilogue in registers, fully Blackwell-native. P3's CUDA RoPE kernel
|
||
demonstrates the raw CUDA path works perfectly.
|
||
|
||
2. **Raw CUDA ≠ scalar math.** Applies to KV-1/KV-2/KV-3. The FP4
|
||
storage path on the read side uses `tcgen05.mma`'s native E2M1 decode
|
||
— no scalar dequant, no `__constant__` LUT (which was only needed
|
||
for the indexer scoring CUDA-core path).
|
||
|
||
3. **Print, don't guess.** Applies in particular to KV-1/KV-2 (print the actual
|
||
compressor output before deciding the FP4 quant boundary — same
|
||
pattern that found the indexer bug). Do not assume the compressor
|
||
emits a shape that matches the FP4 quant kernel; print and confirm.
|
||
|
||
4. **Integration over exploration.** Do not write `Nvfp4MoE_v2`. Do not
|
||
write `KVCache_fp4_v2`. Edit the existing classes. KV-1/KV-2 are
|
||
2-tensor type changes plus the kernel-side read path.
|
||
|
||
5. **Falsifiable gates.** Already listed per priority. Meta-gate: after
|
||
P0–P5 land, decode latency at 8K context should be **single-digit
|
||
ms**, not three-digit. If it isn't, something is still on the hot
|
||
path that shouldn't be, and the answer is "profile, don't guess
|
||
next."
|
||
|
||
6. **Don't optimize for problems you don't have.** TurboQuant is the
|
||
cautionary tale here. The KV cache at 1M is 10 GB on 8 × B200 — that
|
||
is not a problem that needs solving with a new dependency. The
|
||
problem is throughput, and the right answer is FP4 storage + FP4 MMA,
|
||
which is hardware-native and doesn't require codebook lookups.
|