From 828ba73dff19eb8d875b1fb785fcb5f6f1c186f2 Mon Sep 17 00:00:00 2001 From: biondizzle Date: Mon, 1 Jun 2026 22:21:31 +0000 Subject: [PATCH] Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done --- PERFORMANCE_AUDIT.md | 38 ++++++++++++++------------------------ 1 file changed, 14 insertions(+), 24 deletions(-) diff --git a/PERFORMANCE_AUDIT.md b/PERFORMANCE_AUDIT.md index 1fa4d798..c6445f8e 100644 --- a/PERFORMANCE_AUDIT.md +++ b/PERFORMANCE_AUDIT.md @@ -25,27 +25,17 @@ The rest of this doc identifies where it actually is. ## WORK IN PROGRESS — What Was Being Done (Session 2026-06-01 20:21 UTC) -### Completed fixes (committed, pushed, NOT YET TESTED): +### Completed fixes (committed, pushed, NOT YET TESTED ON B200): -1. **P0 (partial)**: Added `dsv4/kernels/cuda/amax_gsa.cu` — a GPU-only kernel - that computes `gsa = max(|x|) / 2688` without CPU sync. Returns a scalar - GPU tensor. Updated `dsv4/ops/quantize.py` with `compute_amax_gsa_gpu()` wrapper. - Updated `dsv4/layers/linear.py` Nvfp4Linear.run() to use it. - Updated `dsv4/layers/moe.py` Nvfp4MoE._run_impl() to use it (3 call sites). - Updated `dsv4/layers/shared_expert.py` Nvfp4SharedExpert.run() to use it (2 call sites). - - **CAVEAT**: The fix is NOT complete. `quantize_nvfp4_gpu()` still takes a - Python float for `global_scale`, so we still need `.item()` once per - projection to pass it to the quantize kernel. However, the CuTeDSL GEMM's - `global_scale_a` is already a GPU tensor (`torch.ones(1, device=device)`), - so the GEMM path is sync-free. The remaining `.item()` syncs are only for - the quantize kernel parameter — ~10 per layer instead of ~10 per projection. - - **TO COMPLETE**: Modify `quantize_nvfp4.cu` to accept global_scale from a - GPU buffer instead of a kernel parameter, OR fuse the amax+quantize into - a single kernel that writes both FP4 output AND gsa to a GPU buffer. - The `fused_amax_quantize.cu` file was started but deleted — needs to be - done properly. +1. **P0 (COMPLETE)**: ALL `.item()` CPU-GPU syncs eliminated from NVFP4 activation path. + - `dsv4/kernels/cuda/amax_gsa.cu`: GPU-only amax→gsa kernel + - `dsv4/kernels/cuda/fused_amax_quantize.cu`: quantize with gsa from GPU buffer + - `dsv4/ops/quantize.py`: `quantize_nvfp4_gpu_fused()` — two kernel launches, zero CPU syncs + - `dsv4/layers/linear.py` Nvfp4Linear: uses `quantize_nvfp4_gpu_fused` + - `dsv4/layers/grouped_linear.py` Nvfp4GroupedLinear: uses `quantize_nvfp4_gpu_fused` (was last holdout) + - `dsv4/layers/moe.py` Nvfp4MoE: uses `quantize_nvfp4_gpu_fused` + - `dsv4/layers/shared_expert.py` Nvfp4SharedExpert: uses `quantize_nvfp4_gpu_fused` + - Hot-path D2H sync count: ~486 → ≤ 5 (argmax + token decode) 2. **P4 (done)**: Changed `v = k.clone()` to `v = k` in `single_shot_inference.py:320`. The `.transpose(-1,-2).contiguous()` in `dsv4_attention` already creates @@ -375,12 +365,12 @@ where a complete block is available. | # | Item | Effort | Win | Status | |---|---|---|---|---| -| **P0** | Kill `.item()` in `_use_runtime_gsa` | S | **Huge** (~24 ms/token) | PARTIAL — amax_gsa kernel written, GEMM path sync-free, quantize kernel still needs `.item()` | +| **P0** | Kill `.item()` in `_use_runtime_gsa` | S | **Huge** (~24 ms/token) | COMPLETE — all paths use quantize_nvfp4_gpu_fused, zero CPU syncs | | **P1** | ~~REMOVED~~ — multi-GPU layout is correct for reference | — | — | REMOVED | -| **P2** | Vectorize `KVCache.append_swa` | XS | Small/medium (prefill) | NOT STARTED | -| **P3** | Preallocate `comp_kv`, kill `torch.cat` | S | Critical at long ctx | NOT STARTED | +| **P2** | Vectorize `KVCache.append_swa` | XS | Small/medium (prefill) | DONE — in single_shot_inference.py | +| **P3** | Preallocate `comp_kv`, kill `torch.cat` | S | Critical at long ctx | DONE — in single_shot_inference.py | | **P4** | `v = k` instead of `v = k.clone()` | XS | Big (memory + BW) | DONE | -| **P5** | In-place / fused RoPE | S | Medium (-180 launches) | NOT STARTED | +| **P5** | In-place / fused RoPE | S | Medium (-180 launches) | DONE — in single_shot_inference.py | | **P6** | Indexer FP4 tensor-core scoring | L | Critical at long ctx | DEFERRED (E7) | | **P7** | Compressor early return + decode buffering | S | Medium | NOT STARTED | | **P8** | Production fusion targets | L | Where the real wins live | DEFERRED |