Files
nvfp4-megamoe-kernel/PERFORMANCE_AUDIT.md
biondizzle 040b2eb6e7 perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill
P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True).
    Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split,
    SiLU, clamp, elementwise multiply → single fused kernel launch).

P1: Enable fused SwiGLU for shared expert (SE):
    - Added set_fused_swiglu() method to Nvfp4SharedExpert
    - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group)
    - Interleave L1 weights at finalize time for fused kernel compatibility
    - Fused kernel handles SwiGLU + clamp in registers, outputs BF16

P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear:
    - _activation_global_scale is set once at warmup, never changes after
    - Skip redundant fill_() via _gsa_buf_initialized flag
    - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers)

P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)
2026-06-02 06:59:25 +00:00

19 KiB
Raw Blame History

PERFORMANCE — v17 roadmap toward end-to-end NVFP4 hot path

Verified state. v17 has the Tier-1 indexer fixes landed (weight path, buffer width, MQA einsum). Hot-path syncs and allocator churn from earlier perf rounds are gone. The single_shot now genuinely runs through the production NVFP4 kernel stack. What remains is fusion gaps and KV-cache dtype choices — the difference between "uses NVFP4 kernels" and "is NVFP4 end-to-end."

On TurboQuant — verdict first, reasoning below. Don't use it for DSv4. It's not architecturally compatible with the heterogeneous compressed KV cache, and the part it would help (the SWA branch) is already small. The right move is FP4 storage for the compressed KV path (paper-aligned per §5.2.1), not vector-quantization codebooks. Full reasoning in Section 4.


PART 1 — THE NVFP4-EVERYWHERE GAP

P0 — Fused SwiGLU exists in the library and is NEVER ENABLED

This is the biggest single-line perf bug in v17.

dsv4/layers/moe.py:61:

self._fused_swiglu = False  # Set via set_fused_swiglu()

set_fused_swiglu() exists (moe.py:103), warmup_fused_swiglu_compilation exists and is wired into the warmup path, the fused kernel run_fused_swiglu_grouped_gemm is implemented and tested. But searching single_shot_inference.py for set_fused_swiglu returns zero hits.

What this costs every layer, every token:

moe.py:640660 (the unfused branch that runs by default):

l1_out = run_nvfp4_grouped_gemm(...)            # NVFP4 → BF16 GEMM
l1_deil = deinterleave_l1_weights(l1_out...)    # BF16 → BF16 deinterleave (extra launch)
gate = l1_deil[:, :self.intermediate_size]       # BF16 slice
up = l1_deil[:, self.intermediate_size:]         # BF16 slice
gate_silu = F.silu(gate)                         # BF16 SiLU launch
if swiglu_limit:                                  #
    gate_silu = gate_silu.clamp(...)              # BF16 clamp launch
    up = up.clamp(...)                            # BF16 clamp launch
activated = gate_silu * up                        # BF16 elementwise
slot_l2_x_fp4, slot_l2_x_sf, _ = quantize_nvfp4_gpu_fused(activated)  # back to FP4

That's 8 BF16-tensor-resident kernel launches per layer per token, moving 2× intermediate_size × n_active_experts BF16 elements through HBM, between two NVFP4 GEMMs that could have been fused.

What the fused path does (moe.py:617625):

  • Single launch: NVFP4 GEMM + SwiGLU + clamp in kernel registers
  • Output goes directly to FP4 in deinterleave_amax_quantize_nvfp4_fused

For Pro (n_active=6, intermediate=3072), per token, all 30 MoE layers:

  • 30 × 6 × (3072 BF16 = 6 KB) × 2 (R+W) × 8 launches ≈ 3 MB of pointless BF16 HBM traffic per token, plus 240 unfused launches.

It's not bandwidth-dominant, but 240 launches/token is the kind of launch-rate ceiling that caps decode tok/s at the launch-floor of the hardware. B200 launch rate ~12 µs in practice. That's 240480 µs/token of pure launch overhead from this one missing call.

The fix

One line in main(), in the MoE/SE setup loop:

for li in range(n_layers):
    if li in moes:
        moes[li].set_fused_swiglu(True)
        moes[li].set_swiglu_limit(cfg.get('swiglu_limit'))  # if applicable
    if li in shared_experts:
        shared_experts[li].set_fused_swiglu(True)
        shared_experts[li].set_swiglu_limit(cfg.get('swiglu_limit'))

Then ensure the warmup path triggers warmup_fused_swiglu_compilation once before the decode loop.

Falsifiable gate

After enabling: per-MoE-layer launch count drops from ~9 to ~2 (the GEMM

  • the L2 path). Verifiable with Nsight or cudaLaunchKernel counter. Numerical parity: cos ≥ 0.9995 vs unfused, captured before the switch.

P1 — Shared expert has the same fused-path gap

The shared expert (shared_expert.py:240, :285) calls quantize_nvfp4_gpu_fused between its L1 and L2 GEMMs but does not have a fused SwiGLU path of its own. Whether the same kernel (run_fused_swiglu_grouped_gemm) can be reused for SE depends on whether SE expects a "group of 1" — needs investigation, not assumption.

Action (read, don't guess)

Print the shapes and dtypes of SE's L1 GEMM input/output and compare to what run_fused_swiglu_grouped_gemm expects. If they match (modulo groups=1), wire it. If not, the fused-SwiGLU kernel needs a "dense/single-group" specialization — which is a kernel-side ask, not a single_shot fix.

Falsifiable gate

Either SE uses the same fused kernel as MoE (same launch-count savings), or there's a documented .md paper trail explaining why it can't and what the production path is.

P2 — Linear .run() per-call FP32 scale uploads still exist

dsv4/layers/linear.py:188:

gsa = self._gsa_buf.fill_(self._activation_global_scale)

After the earlier P0 fix (_use_runtime_gsa = False), this no longer syncs via .item(). But it still does a CPU→GPU scalar fill per call. For Pro, 4 Nvfp4Linears in attention × 61 layers = 244 fill_() calls per token. At ~5 µs each that's ~1.2 ms/token of CPU→GPU dispatch.

The fix

Make _activation_global_scale a 1-element torch.Tensor on device, set once at warmup. The fill becomes redundant — pass self._gsa_buf directly to the kernel, no per-call fill needed.

# In Nvfp4Linear.__init__:
self._gsa_buf = torch.full((1,), 1.0 / (6.0 * 448.0), dtype=torch.float32, device=device)

# After compute_activation_global_scale (runs once at warmup):
self._gsa_buf.fill_(gs)   # ONE TIME, not per call

# In run():
self.kernel(..., global_scale_a=self._gsa_buf)  # no fill

Falsifiable gate

Zero CPU→GPU scalar fills on the hot path. Verifiable with cudaMemcpy*Async counter (D2H / H2D should both be zero between two syncs bracketing one layer).

P3 — In-kernel RoPE fusion (still on the table, deferred from prior audit)

P5 from the v15 audit: in-place RoPE eliminated the clone problem, but RoPE is still 3 separate launches per attention block × 61 layers ≈ 183 launches per token. Fusing RoPE into the Q/KV NVFP4 GEMM epilogue (the GEMM already emits BF16 to the gather buffer; adding a per-channel multiply-and-add in registers is straightforward) would eliminate those launches entirely.

This is a kernel-side change, not a single_shot fix. Production target, not single_shot target. Track it but don't gate the perf rollup on it.

Falsifiable gate (when kernel work lands)

RoPE launch count: 183/token → 0/token. End-to-end cos ≥ 0.999998 vs unfused.


PART 2 — KV CACHE: WHAT'S ALREADY FP4-COMPATIBLE, WHAT ISN'T

DSv4's three KV streams have very different characteristics. Treating them uniformly is the trap.

Stream Stored width At 1M ctx Per-access pattern Quantizable?
CSA main compressed hd=512 BF16 256 MB × 30 = ~7.5 GB Random access via top-k (~1024 entries / query) Yes — FP4 strongly indicated
CSA indexer keys c_I=128 BF16 64 MB × 30 = ~2 GB Streamed full-cache for top-k scoring Yes — FP4 paper-specified §5.2.1
HCA compressed hd=512 BF16 8 MB × 30 = 240 MB Full sequential read every layer Yes — FP4 indicated
SWA hd=512 BF16 128 KB × 61 = 8 MB Sequential ring buffer, recent 128 tokens No — too small to matter

Total BF16: ~10 GB at 1M context. Per the prior audit rewrite, this fits comfortably on 8×B200. So KV quantization is a throughput question, not a memory question.

Why FP4 storage is the right answer for the compressed streams

Three reasons, in priority order:

  1. Paper-aligned. §5.2.1 explicitly specifies the indexer QK path runs entirely in FP4. The main compressed KV cache being FP4 is consistent with the rest of the NVFP4 model — the cache is, after all, just stored projections of NVFP4 weights × BF16 hidden states.

  2. Bandwidth. Decode is KV-read-bound at long context. Reading FP4 instead of BF16 quarters the bytes-per-token loaded by FMHA. At top_k=1024, hd=512, 30 CSA layers: that's 30 × 1024 × 512 × 1.5 bytes saved = 23 MB/token saved. Across batch=8 and millions of decode steps, real money.

  3. Kernel-native on Blackwell. Loading FP4 → tcgen05.mma is a first-class path with TMA + UMMA + the mxf4nvf4 MMA kind. The in-kernel dequant happens for free during the MMA. The infrastructure exists in the production FMHA kernel already (per the prior epilogue_op work and the ENABLE_FP4_EPILOGUE template param).

What this looks like in code

The compressed KV write path currently lands BF16 in comp_kv_buf. The production sequence should be:

  1. Compressor produces BF16 output (still — the softmax compression needs accumulation precision).
  2. Quantize-to-NVFP4 in the same kernel as the compression (epilogue fusion), using the same NVFP4 quant primitives the linears already use (quantize_nvfp4_gpu_fused).
  3. Store FP4 + per-block E4M3 scales in comp_kv_buf (which becomes a FP4 buffer + scale buffer pair).
  4. FMHA reads FP4, dequants in-kernel via TMA + tcgen05's native FP4 path. No __constant__ LUT needed — the hardware decodes E2M1.

For the indexer keys this is the same pattern but the consumer is the indexer scoring kernel (the FP32 einsum today, the FP4 tensor-core scorer when E7 lands).

Falsifiable gate (per stream)

  • CSA main + HCA + indexer: end-to-end output cos ≥ 0.999 with FP4 storage vs BF16. KV cache memory at 8K context drops by ~3.5× (8 → 2.3 GB). FMHA-bound decode latency at 8K context drops measurably.
  • Recall@k for indexer ≥ 99% vs FP32 oracle (the bar from the prior indexer-fix audit). Critical — FP4 must not corrupt top-k ranking.

PART 3 — OTHER FUSION WINS, RANKED BY EFFORT/IMPACT

P4 — Fuse RMSNorm into the next NVFP4 quantize

Q/KV projection input is RMSNormed; RMSNorm is a separate launch. The NVFP4 quantize kernel already does an amax reduction per group — fusing RMSNorm (which is also an amax-style reduction followed by a scale) into the quantizer's input is a natural fit. Saves a launch + a BF16 materialization of (T, H) per RMSNorm site (2 per layer = 122/token).

Effort: S (kernel-side, but the quantizer already has the right shape). Impact: Medium. 122 launches/token, ~0.7 ms/token from launch overhead alone.

P5 — Fuse mHC pre_block + RMSNorm into a single op

Same logic as P4 but for mHC. attn_mhc.pre_block(X_l)rmsnorm is 3 kernels back-to-back. Fusable. mHC already exposes a _project_and_rms half per prior audit notes — wire it through both halves of the layer.

Effort: S. Impact: Medium. ~120 launches/token.

P6 — CUDA graph capture (the big one, last)

Single biggest single-token win after everything above. Captures the entire decode step into a graph; replay eliminates all launch overhead. Probably worth 23× speedup at batch=1.

Blockers in v17:

  1. set_device() boundaries in the layer pipeline (the cuda.synchronize() at line 963) — graph capture spans devices via multi-graph or per-device sub-graphs. Manageable but not free.
  2. Dynamic shape in KVCache.add_compressedself.n_comp grows. Fix: capture one graph per prefill chunk size, replay per decoded token (which has fixed T=1 shape; the growing buffer is a write into a pre-allocated tensor, capturable).
  3. Any conditional if on tensor data — debug prints, the assertion at line 608. Strip from the capture path with a flag.

Effort: L. Impact: Huge (the biggest remaining single win). Sequence: land after P0/P1/P2/P3 so the captured graph reflects the post-fusion structure.


PART 4 — TURBOQUANT: ARCHITECTURAL VERDICT

Reading turboquant/: this is an ICLR 2026 paper implementation of vector-quantization KV compression. Two algorithms:

  • MSE-quantize keys/values via codebook (3 bit by default)
  • Inner-product-aware quantize keys (preserves dot products) via Algorithm 2
  • Per-vector L2-norm preserved separately, plus QJL sign sketch for residual recovery

Operational shape:

  • Operates on standard MHA/GQA shape (..., n_heads, head_dim), head_dim typically 128.
  • Requires a head_dim × head_dim rotation matrix per layer (precomputed from random seed, shared across heads).
  • Has a Triton fused-decode kernel that computes attention scores directly from packed codebook indices.
  • vLLM integration via turboquant/vllm_attn_backend.py.

Why it doesn't fit DSv4

Three structural mismatches, in order of severity:

1. The DSv4 KV cache is already a learned compression

DSv4 doesn't store per-token KV. The CSA compressor's whole job is to reduce m=4 tokens into 1 compressed entry via a softmax-weighted mix. That entry is what gets cached. TurboQuant quantizes the post-projection per-token KV of standard attention — exactly the thing DSv4 has already replaced with a learned compressor. You'd be applying a lossy compression on top of an already-lossy compression, which (a) compounds loss in an uncontrolled way and (b) attacks the wrong dimension. The compressed entries are already 4× (CSA) or 128× (HCA) reduced in the sequence dimension; further reducing the head dimension via codebook gives little additional savings (you're already attending over very few entries per query) at high quality cost.

2. Wrong shape, wrong primitive

TurboQuant operates on (..., n_heads, head_dim=128) per-token vectors and uses a 128×128 random rotation. DSv4's compressed cache is shape (n_comp, head_dim=512) — no head dimension. The whole "rotate the head dim" abstraction needs to be reworked, and once you do, you're writing new code that isn't TurboQuant anymore.

For the indexer keys, the storage is per-block 128-dim, which is closer to TurboQuant's natural shape. But the indexer's scoring math is ReLU(q·k) · w_h summed across heads — TurboQuant's "preserve inner products" guarantee from Algorithm 2 doesn't compose with the ReLU nonlinearity. The quantization error becomes worst-case at the threshold, which is where top-k decisions get made. Bad fit precisely where it matters most.

3. NVFP4 hardware exists; TurboQuant is software-only

TurboQuant runs as bit-packed uint8 + Triton kernels. It can't use tcgen05 FP4 tensor cores because its values aren't FP4 — they're codebook indices. So you'd be paying CPU/GPU cycles to dequant via gathers and per-token rotation matrix-vector multiplies, when the same storage cost (4 bits/value) is available natively as FP4 with hardware dequant during MMA.

The TurboQuant benchmark numbers (+35% throughput at 3-bit) are real, but they're against bf16_kv baselines on architectures that don't have FP4 tensor cores. On Blackwell with NVFP4, the comparison should be FP4 storage + FP4 MMA — which is strictly better in every axis (bandwidth, capacity, dequant cost).

Where TurboQuant would help, and the verdict on whether it's worth it

The only DSv4 stream where TurboQuant's shape is a natural fit is the SWA branch — uncompressed per-token KV in the sliding window, 128 tokens × n_layers × hd=512 = 8 MB at 1M context.

It's 8 MB. Not worth a new dependency, a paper-grade extra failure mode, or the rotation overhead. The SWA branch fits in L2 cache on B200.

Verdict

Don't use TurboQuant. The right move for DSv4's KV cache is **FP4 storage

  • FP4 MMA on the compressed streams**, fully Blackwell-native, paper- aligned (§5.2.1), with no codebook lookup overhead. The infrastructure to do this is already in your kernel library (the ENABLE_FP4_EPILOGUE template, the FP4 MMA path).

If you want a paper to cite for "what's the state-of-the-art KV compression in 2026," TurboQuant is one. If you want the highest-perf production-grade DSv4 implementation, native FP4 is the answer.


PRIORITY ORDER

# Item Effort Win Type
P0 Call set_fused_swiglu(True) on all MoEs XS 240480 µs/token one-line script fix
P1 Same for shared expert (after print-and-confirm) S ~120 µs/token likely script fix
P2 Drop per-call fill_() in Nvfp4Linear S ~1.2 ms/token library fix
KV-1 FP4 storage for CSA main compressed KV M Huge at long context kernel + script
KV-2 FP4 storage for HCA compressed KV M Same pattern as KV-1 reuses KV-1 work
KV-3 FP4 storage for indexer keys (pair with E7) M Throughput + paper compliance kernel work
P3 RoPE fused into Q/KV GEMM epilogue M 183 launches/token kernel work
P4 RMSNorm fused into next quantize S 122 launches/token kernel work
P5 mHC pre_block + RMSNorm fused S ~120 launches/token kernel work
P6 CUDA graph capture L 23× total after everything above

P0 first. It's a one-line edit that unlocks the fused kernel that already exists. It is the most embarrassingly easy and most embarrassingly overlooked perf bug in v17. The kernel author already did the hard work; the script just isn't asking for it.

After P0/P1/P2 land, the linear hot path is genuinely tight and the remaining wins are kernel-side fusion (P3/P4/P5) and the KV cache dtype question (KV-1/KV-2/KV-3). Land all of those before attempting CUDA graphs — the captured graph should reflect the final fused structure, not the pre-fusion one.


DOCTRINE

  1. DSL wall → raw CUDA C++, not Python. Applies to P3/P4/P5 (kernel- side fusion work). The fused-SwiGLU kernel already exists as a model for what these should look like — it's NVFP4 GEMM + arbitrary-op epilogue in registers, fully Blackwell-native.

  2. Raw CUDA ≠ scalar math. Applies to KV-1/KV-2/KV-3. The FP4 storage path on the read side uses tcgen05.mma's native E2M1 decode — no scalar dequant, no __constant__ LUT (which was only needed for the indexer scoring CUDA-core path).

  3. Print, don't guess. Applies in particular to P1 (verify SE shapes can use the MoE fused kernel) and KV-1/KV-2 (print the actual compressor output before deciding the FP4 quant boundary — same pattern that found the indexer bug). Do not assume the compressor emits a shape that matches the FP4 quant kernel; print and confirm.

  4. Integration over exploration. Do not write Nvfp4MoE_v2. Do not write KVCache_fp4_v2. Edit the existing classes. P0 is one line in main(). KV-1/KV-2 are 2-tensor type changes plus the kernel-side read path.

  5. Falsifiable gates. Already listed per priority. Meta-gate: after P0P5 land, decode latency at 8K context should be single-digit ms, not three-digit. If it isn't, something is still on the hot path that shouldn't be, and the answer is "profile, don't guess next."

  6. Don't optimize for problems you don't have. TurboQuant is the cautionary tale here. The KV cache at 1M is 10 GB on 8 × B200 — that is not a problem that needs solving with a new dependency. The problem is throughput, and the right answer is FP4 storage + FP4 MMA, which is hardware-native and doesn't require codebook lookups.