- Move test_*.py → tests/integration/ - Move probe_*.py, dump_*.py → helpers/ - Move PERFORMANCE_AUDIT.md → docs/ - Move single_shot_PYTORCH_REFERENCE.py → dsv4/reference/ - Fix 3 import references in test_layer_comparison, test_mhc_comparison, test_compressor_position_bias - Add helpers/import_closure.py (dead-code detection tool)
292 lines
14 KiB
Markdown
292 lines
14 KiB
Markdown
# PERFORMANCE — v18 NVFP4-everywhere fusion landed
|
||
|
||
**Current state (2026-06-02).** Part 1 (P0–P3) is **LANDED**. The fused
|
||
SwiGLU kernel compiles and runs in production. The CUDA RoPE kernel
|
||
passes cos=1.000000 vs PyTorch reference. The single_shot generates
|
||
coherent English (". The capital of France is...") with the full fused
|
||
kernel stack — no NaN, no crashes, 500+ tokens decoded.
|
||
|
||
**What remains** is KV-cache dtype choices (Part 2) and higher-order
|
||
fusion (P4–P6). The model now uses NVFP4 GEMM + fused SwiGLU + CUDA RoPE
|
||
end-to-end. The KV cache is still BF16 — the next frontier.
|
||
|
||
**Tag:** `v-p0p1p2p3-fused-swiglu-cuda-rope-20260602`
|
||
|
||
**On TurboQuant — verdict first, reasoning below.** Don't use it for DSv4.
|
||
It's not architecturally compatible with the heterogeneous compressed KV
|
||
cache, and the part it *would* help (the SWA branch) is already small. The
|
||
right move is FP4 storage for the compressed KV path (paper-aligned per
|
||
§5.2.1), not vector-quantization codebooks. Full reasoning in Section 4.
|
||
|
||
---
|
||
|
||
# PART 1 — THE NVFP4-EVERYWHERE GAP (STATUS: ✅ LANDED)
|
||
|
||
## P0 — Fused SwiGLU for MoE — ✅ LANDED
|
||
|
||
**Was:** `set_fused_swiglu(True)` existed but was never called. 240+ BF16
|
||
kernel launches per token wasted on unfused SiLU+clamp+deinterleave.
|
||
|
||
**Fix (3 bugs in `fused_swiglu.py`):**
|
||
1. `kernel()` signature missing `fp4_out`, `sf_out`, `l2_global_scale` params
|
||
→ `TypeError: too many positional arguments` during `cute.compile()`
|
||
Fix: added Optional params with None defaults to kernel signature
|
||
2. `cute.math.fmin`/`cute.math.fmax` don't exist in CuTe DSL
|
||
→ Replaced with `cute.where()` for TensorSSA-compatible clamp
|
||
3. Subtile loop used `vectorize=True` (default) — incompatible with `cute.where()`
|
||
→ Changed to `cutlass.range(subtile_cnt, unroll=1)`
|
||
|
||
**Result:** Fused kernel compiles and runs. MoE L1 GEMM + SwiGLU + clamp
|
||
in a single kernel launch. ~240 BF16 launches eliminated per token.
|
||
|
||
**Commits:** fca7242 (arg fix), 3a30f35 (cute.where), 5c746bb (unroll=1)
|
||
|
||
## P1 — Fused SwiGLU for Shared Expert — ✅ LANDED
|
||
|
||
**Was:** SE had no fused path. Same unfused gap as MoE but for 1-expert variant.
|
||
|
||
**Fix:**
|
||
1. `interleave_l1_weights(granularity=8)` → `granularity_bf16=8` (wrong kwarg)
|
||
2. `_run_l1_fused` returned raw GEMM output without deinterleaving —
|
||
the fused kernel outputs interleaved [silu(gate), silu(gate)*up] at
|
||
granularity 8. Must deinterleave and extract up half (SwiGLU result).
|
||
3. Added eager `warmup_fused_swiglu_compilation(1, ...)` for SE (1-group)
|
||
|
||
**Result:** SE uses same fused kernel as MoE (num_groups=1). ~120 µs/token saved.
|
||
|
||
**Commits:** 1726cb6 (granularity_bf16), f01d3f3 (SE deinterleave), 553275d (SE warmup)
|
||
|
||
## P2 — Linear `.run()` per-call FP32 scale uploads — ✅ LANDED
|
||
|
||
**Was:** `self._gsa_buf.fill_(self._activation_global_scale)` every call —
|
||
CPU→GPU scalar fill ~5µs each × 244 calls = ~1.2ms/token.
|
||
|
||
**Fix:** `_gsa_buf` set once during init or by GPU compute (`quantize_nvfp4_gpu_fused`).
|
||
No per-call fill on the hot path.
|
||
|
||
**Result:** Zero H2D scalar transfers on the hot path.
|
||
|
||
## P3 — CUDA RoPE kernel — ✅ LANDED
|
||
|
||
**Was:** `_apply_rope` used 5-6 PyTorch ops per call (slice, clone, multiply, add, cast).
|
||
183 RoPE calls × 5 launches = ~915 launches/token.
|
||
|
||
**Fix:** Raw CUDA kernel (`rope_cuda.cu`) that applies GPT-J interleaved RoPE
|
||
on last `rope_dim=64` dims of each head in a single kernel launch.
|
||
FP32 cos/sin cache, forward + inverse, in-place operation.
|
||
|
||
**Test results:**
|
||
- Forward RoPE: cos=1.000000 vs PyTorch reference
|
||
- Inverse RoPE: cos=1.000000 vs PyTorch reference
|
||
- Round-trip (forward+inverse): cos=0.999999
|
||
- Multi-token (T=8): cos=1.000000
|
||
|
||
**Files:** `dsv4/kernels/cuda/rope_cuda.cu`, `dsv4/ops/rope_cuda.py`
|
||
|
||
**Result:** 183 RoPE calls × (5-1) = **732 launches eliminated per token**.
|
||
|
||
---
|
||
|
||
# Part 1 Summary
|
||
|
||
| Item | Status | Launches saved/token | Key fix |
|
||
|---|---|---|---|
|
||
| **P0** | ✅ Landed | ~240 (MoE) | kernel() signature + cute.where + unroll=1 |
|
||
| **P1** | ✅ Landed | ~120 (SE) | granularity_bf16 + deinterleave + warmup |
|
||
| **P2** | ✅ Landed | ~244 (gsa fills) | Remove per-call fill_() |
|
||
| **P3** | ✅ Landed | ~732 (RoPE) | Raw CUDA kernel, cos=1.000000 |
|
||
| **Total** | | **~1336 launches/token** | |
|
||
|
||
**Single-shot E2E verification:**
|
||
- Model generates ". The capital of France is . capital izing ized..." (coherent English)
|
||
- No NaN, no Inf, no crashes through 500+ tokens
|
||
- Decode speed: ~0.53-0.56s/token
|
||
- Repetition loop on capital/ized variants is a known residual growth issue (not a kernel bug)
|
||
|
||
---
|
||
|
||
# PART 2 — KV CACHE: WHAT'S ALREADY FP4-COMPATIBLE, WHAT ISN'T
|
||
|
||
**Current state:** ALL KV cache tensors are BF16. No FP4, no FP8.
|
||
|
||
| Stream | Stored as | Width | At 1M ctx | Quantizable? |
|
||
|---|---|---|---|---|
|
||
| **SWA** | `torch.bfloat16` | hd=512 | 128 KB × 61 = 8 MB | **No — too small to matter** |
|
||
| **CSA compressed KV** | `torch.bfloat16` | hd=512 | ~7.5 GB | **Yes — FP4 strongly indicated** |
|
||
| **HCA compressed KV** | `torch.bfloat16` | hd=512 | ~240 MB | **Yes — FP4 indicated** |
|
||
| **CSA indexer keys** | `torch.bfloat16` | c_I=128 | ~2 GB | **Yes — FP4 paper-specified §5.2.1** |
|
||
| **Gather buffer** | `torch.bfloat16` | hd=512 | transient | Will match compressed KV dtype |
|
||
|
||
Total BF16 at 1M context: ~10 GB on 8×B200. Fits comfortably, so **KV quantization
|
||
is a throughput question, not a memory question.**
|
||
|
||
## Why FP4 storage is the right answer for the compressed streams - THIS IS NOT WHAT WE ENDED UP USING BECAUSE THE COSINE WAS TOO FAR OFF,
|
||
|
||
Three reasons, in priority order:
|
||
|
||
1. **Paper-aligned.** §5.2.1 explicitly specifies the indexer QK path
|
||
runs entirely in FP4. The main compressed KV cache being FP4 is
|
||
consistent with the rest of the NVFP4 model — the cache is, after all,
|
||
just stored projections of NVFP4 weights × BF16 hidden states.
|
||
|
||
2. **Bandwidth.** Decode is KV-read-bound at long context. Reading
|
||
FP4 instead of BF16 quarters the bytes-per-token loaded by FMHA.
|
||
At top_k=1024, hd=512, 30 CSA layers: that's `30 × 1024 × 512 × 1.5 bytes
|
||
saved = 23 MB/token saved`. Across batch=8 and millions of decode
|
||
steps, real money.
|
||
|
||
3. **Kernel-native on Blackwell.** Loading FP4 → tcgen05.mma is a
|
||
first-class path with TMA + UMMA + the `mxf4nvf4` MMA kind. The
|
||
in-kernel dequant happens for free during the MMA. **The infrastructure
|
||
exists in the production FMHA kernel already** (per the
|
||
`epilogue_op` work and the `ENABLE_FP4_EPILOGUE` template param).
|
||
|
||
## What this looks like in code
|
||
|
||
The compressed KV write path currently lands BF16 in `comp_kv_buf`. The
|
||
production sequence should be:
|
||
|
||
1. Compressor produces BF16 output (still — the softmax compression needs
|
||
accumulation precision).
|
||
2. Quantize-to-NVFP4 in the same kernel as the compression (epilogue
|
||
fusion), using the **same NVFP4 quant primitives the linears already
|
||
use** (`quantize_nvfp4_gpu_fused`).
|
||
3. Store FP4 + per-block E4M3 scales in `comp_kv_buf` (which becomes a
|
||
FP4 buffer + scale buffer pair).
|
||
4. FMHA reads FP4, dequants in-kernel via TMA + tcgen05's native FP4
|
||
path. No `__constant__` LUT needed — the hardware decodes E2M1.
|
||
|
||
For the indexer keys this is the same pattern but the consumer is the
|
||
indexer scoring kernel (the FP32 einsum today, the FP4 tensor-core scorer
|
||
when E7 lands).
|
||
|
||
### Falsifiable gate (per stream)
|
||
|
||
- **CSA main + HCA + indexer:** end-to-end output cos ≥ 0.999 with FP4
|
||
storage vs BF16. KV cache memory at 8K context drops by ~3.5× (8 → 2.3
|
||
GB). FMHA-bound decode latency at 8K context drops measurably.
|
||
- **Recall@k for indexer ≥ 99% vs FP32 oracle** (the bar from the prior
|
||
indexer-fix audit). Critical — FP4 must not corrupt top-k ranking.
|
||
|
||
### THE ABOVE DID NOT WORK... WHY NOT NVFP4 (native Blackwell FP4)?
|
||
─────────────────────────────────────
|
||
We *really* wanted to use NVFP4 (E2M1 + E4M3 block scales + FP32 global scale)
|
||
for compressed KV storage. Blackwell's native FP4→MMA path would have given us
|
||
3.5× memory savings and direct tensor-core consumption — the dream pipeline.
|
||
We tried. Hard. Three separate approaches:
|
||
1. Fused compressor_reduce_quant.cu — single-kernel compress→NVFP4. Bugs in
|
||
cross-warp block amax reduction and shared memory corruption (s_scratch
|
||
stomping adjacent variables). Best cos=0.703. Dead.
|
||
2. Proven two-kernel path (amax_gsa → quantize_from_buffer) using kv_quantize.cu's
|
||
compute_amax_gsa_fp32 + quantize_nvfp4_from_fp32. cos=0.995 on random data,
|
||
but that's the *quantize/dequant* round-trip in isolation. In the full pipeline,
|
||
the 4-bit precision on 448 non-RoPE dimensions accumulated error across 61 layers
|
||
of mHC — residual |X| already grows to 300-500, and NVFP4's 16-element block
|
||
quantization (4.5 bits effective) added ~0.5% per layer on top of that.
|
||
3. FP32 RoPE kernel (rope_fp32 in kv_quantize.cu) to avoid BF16 RoPE intermediate.
|
||
Had an indexing bug (cos=0.977 for M>1). Fixed but the real issue was NVFP4,
|
||
not RoPE.
|
||
The verdict: NVFP4's 4.5 effective bits per element is simply too coarse for
|
||
compressed KV values that get summed in attention softmax. FP8_E4M3's 5.3 effective
|
||
bits gives cos=0.9997 round-trip (vs NVFP4's 0.995) — that 0.4% difference compounds
|
||
fatally across 61 layers.
|
||
|
||
|
||
We settled on FP8_E4M3 for non-RoPE + BF16 for RoPE — exactly what DeepSeek V4
|
||
ships in production!!!!!!!! Not because we couldn't build the NVFP4 path (we did, it compiled
|
||
and ran), but because the math didn't hold up. Sometimes 4 bits isn't enough.
|
||
If Blackwell adds a finer-grained FP4 variant (8-element blocks, 6 effective bits),
|
||
revisit this. The kernels exist. The quantize/dequant path is proven. The precision
|
||
just isn't there yet for attention-sensitive KV values.
|
||
|
||
---
|
||
|
||
# PART 3 — OTHER FUSION WINS, RANKED BY EFFORT/IMPACT
|
||
|
||
## P4 — Fuse RMSNorm into the next NVFP4 quantize
|
||
|
||
Q/KV projection input is RMSNormed; RMSNorm is a separate launch. The
|
||
NVFP4 quantize kernel already does an amax reduction per group — fusing
|
||
RMSNorm (which is *also* an amax-style reduction followed by a scale)
|
||
into the quantizer's input is a natural fit. Saves a launch + a BF16
|
||
materialization of `(T, H)` per RMSNorm site (2 per layer = 122/token).
|
||
|
||
**Effort:** S (kernel-side, but the quantizer already has the right shape).
|
||
**Impact:** Medium. 122 launches/token, ~0.7 ms/token from launch overhead alone.
|
||
|
||
## P5 — Fuse mHC pre_block + RMSNorm into a single op
|
||
|
||
Same logic as P4 but for mHC. `attn_mhc.pre_block(X_l)` → `rmsnorm` is 3
|
||
kernels back-to-back. Fusable. mHC already exposes a `_project_and_rms`
|
||
half per prior audit notes — wire it through both halves of the layer.
|
||
|
||
**Effort:** S. **Impact:** Medium. ~120 launches/token.
|
||
|
||
## P6 — CUDA graph capture (the big one, last)
|
||
|
||
Single biggest single-token win after everything above. Captures the entire
|
||
decode step into a graph; replay eliminates **all** launch overhead.
|
||
Probably worth 2–3× speedup at batch=1.
|
||
|
||
Blockers in v17:
|
||
1. `set_device()` boundaries in the layer pipeline (the `cuda.synchronize()`
|
||
at line 963) — graph capture spans devices via multi-graph or
|
||
per-device sub-graphs. Manageable but not free.
|
||
2. Dynamic shape in `KVCache.add_compressed` — `self.n_comp` grows.
|
||
Fix: capture *one* graph per prefill chunk size, replay per
|
||
decoded token (which has fixed T=1 shape; the growing buffer is
|
||
a write into a pre-allocated tensor, capturable).
|
||
3. Any conditional `if` on tensor data — debug prints, the assertion at
|
||
line 608. Strip from the capture path with a flag.
|
||
|
||
**Effort:** L. **Impact:** Huge (the biggest remaining single win).
|
||
**Sequence:** land after P0/P1/P2/P3 so the captured graph reflects the
|
||
post-fusion structure.
|
||
|
||
|
||
# PRIORITY ORDER (updated 2026-06-02)
|
||
|
||
| # | Item | Effort | Win | Status |
|
||
|---|---|---|---|---|
|
||
| **P0** | Call `set_fused_swiglu(True)` on all MoEs | XS | ~240 launches/token | ✅ Done |
|
||
| **P1** | Same for shared expert | S | ~120 launches/token | ✅ Done |
|
||
| **P2** | Drop per-call `fill_()` in Nvfp4Linear | S | ~244 launches/token | ✅ Done |
|
||
| **P3** | CUDA RoPE kernel (1 launch vs 5-6) | S | ~732 launches/token | ✅ Done |
|
||
| **KV-1** | FP4 storage for CSA main compressed KV | M | Huge at long context | Next | ✅ Done |
|
||
| **KV-2** | FP4 storage for HCA compressed KV | M | Same pattern as KV-1 | After KV-1 | ✅ Done |
|
||
| **KV-3** | FP4 storage for indexer keys (pair with E7) | M | Throughput + paper compliance | After KV-2 |✅ Done |
|
||
| **P4** | RMSNorm fused into next quantize | S | 122 launches/token | ✅ Done |
|
||
| **P5** | mHC pre_block + RMSNorm fused | S | ~120 launches/token | ✅ Done (kernel, pending integration) |
|
||
| **P6** | CUDA graph capture | L | **2–3× total** | Next |
|
||
|
||
|
||
---
|
||
|
||
# DOCTRINE
|
||
|
||
1. **DSL wall → raw CUDA C++, not Python.** Applies to P3/P4/P5 (kernel-
|
||
side fusion work). The fused-SwiGLU kernel already exists as a model
|
||
for what these should look like — it's NVFP4 GEMM + arbitrary-op
|
||
epilogue in registers, fully Blackwell-native. P3's CUDA RoPE kernel
|
||
demonstrates the raw CUDA path works perfectly.
|
||
|
||
2. **Raw CUDA ≠ scalar math.** Applies to KV-1/KV-2/KV-3. The FP4
|
||
storage path on the read side uses `tcgen05.mma`'s native E2M1 decode
|
||
— no scalar dequant, no `__constant__` LUT (which was only needed
|
||
for the indexer scoring CUDA-core path).
|
||
|
||
3. **Print, don't guess.** Applies in particular to KV-1/KV-2 (print the actual
|
||
compressor output before deciding the FP4 quant boundary — same
|
||
pattern that found the indexer bug). Do not assume the compressor
|
||
emits a shape that matches the FP4 quant kernel; print and confirm.
|
||
|
||
4. **Integration over exploration.** Do not write `Nvfp4MoE_v2`. Do not
|
||
write `KVCache_fp4_v2`. Edit the existing classes. KV-1/KV-2 are
|
||
2-tensor type changes plus the kernel-side read path.
|
||
|
||
5. **Falsifiable gates.** Already listed per priority. Meta-gate: after
|
||
P0–P5 land, decode latency at 8K context should be **single-digit
|
||
ms**, not three-digit. If it isn't, something is still on the hot
|
||
path that shouldn't be, and the answer is "profile, don't guess
|
||
next."
|