From 828ba73dff19eb8d875b1fb785fcb5f6f1c186f2 Mon Sep 17 00:00:00 2001
From: biondizzle <biondizzle@gmail.com>
Date: Mon, 1 Jun 2026 22:21:31 +0000
Subject: [PATCH] Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done

---
 PERFORMANCE_AUDIT.md | 38 ++++++++++++++------------------------
 1 file changed, 14 insertions(+), 24 deletions(-)

diff --git a/PERFORMANCE_AUDIT.md b/PERFORMANCE_AUDIT.md
index 1fa4d798..c6445f8e 100644
--- a/PERFORMANCE_AUDIT.md
+++ b/PERFORMANCE_AUDIT.md
@@ -25,27 +25,17 @@ The rest of this doc identifies where it actually is.
 
 ## WORK IN PROGRESS — What Was Being Done (Session 2026-06-01 20:21 UTC)
 
-### Completed fixes (committed, pushed, NOT YET TESTED):
+### Completed fixes (committed, pushed, NOT YET TESTED ON B200):
 
-1. **P0 (partial)**: Added `dsv4/kernels/cuda/amax_gsa.cu` — a GPU-only kernel
-   that computes `gsa = max(|x|) / 2688` without CPU sync. Returns a scalar
-   GPU tensor. Updated `dsv4/ops/quantize.py` with `compute_amax_gsa_gpu()` wrapper.
-   Updated `dsv4/layers/linear.py` Nvfp4Linear.run() to use it.
-   Updated `dsv4/layers/moe.py` Nvfp4MoE._run_impl() to use it (3 call sites).
-   Updated `dsv4/layers/shared_expert.py` Nvfp4SharedExpert.run() to use it (2 call sites).
-   
-   **CAVEAT**: The fix is NOT complete. `quantize_nvfp4_gpu()` still takes a
-   Python float for `global_scale`, so we still need `.item()` once per
-   projection to pass it to the quantize kernel. However, the CuTeDSL GEMM's
-   `global_scale_a` is already a GPU tensor (`torch.ones(1, device=device)`),
-   so the GEMM path is sync-free. The remaining `.item()` syncs are only for
-   the quantize kernel parameter — ~10 per layer instead of ~10 per projection.
-   
-   **TO COMPLETE**: Modify `quantize_nvfp4.cu` to accept global_scale from a
-   GPU buffer instead of a kernel parameter, OR fuse the amax+quantize into
-   a single kernel that writes both FP4 output AND gsa to a GPU buffer.
-   The `fused_amax_quantize.cu` file was started but deleted — needs to be
-   done properly.
+1. **P0 (COMPLETE)**: ALL `.item()` CPU-GPU syncs eliminated from NVFP4 activation path.
+   - `dsv4/kernels/cuda/amax_gsa.cu`: GPU-only amax→gsa kernel
+   - `dsv4/kernels/cuda/fused_amax_quantize.cu`: quantize with gsa from GPU buffer
+   - `dsv4/ops/quantize.py`: `quantize_nvfp4_gpu_fused()` — two kernel launches, zero CPU syncs
+   - `dsv4/layers/linear.py` Nvfp4Linear: uses `quantize_nvfp4_gpu_fused`
+   - `dsv4/layers/grouped_linear.py` Nvfp4GroupedLinear: uses `quantize_nvfp4_gpu_fused` (was last holdout)
+   - `dsv4/layers/moe.py` Nvfp4MoE: uses `quantize_nvfp4_gpu_fused`
+   - `dsv4/layers/shared_expert.py` Nvfp4SharedExpert: uses `quantize_nvfp4_gpu_fused`
+   - Hot-path D2H sync count: ~486 → ≤ 5 (argmax + token decode)
 
 2. **P4 (done)**: Changed `v = k.clone()` to `v = k` in `single_shot_inference.py:320`.
    The `.transpose(-1,-2).contiguous()` in `dsv4_attention` already creates
@@ -375,12 +365,12 @@ where a complete block is available.
 
 | # | Item | Effort | Win | Status |
 |---|---|---|---|---|
-| **P0** | Kill `.item()` in `_use_runtime_gsa` | S | **Huge** (~24 ms/token) | PARTIAL — amax_gsa kernel written, GEMM path sync-free, quantize kernel still needs `.item()` |
+| **P0** | Kill `.item()` in `_use_runtime_gsa` | S | **Huge** (~24 ms/token) | COMPLETE — all paths use quantize_nvfp4_gpu_fused, zero CPU syncs |
 | **P1** | ~~REMOVED~~ — multi-GPU layout is correct for reference | — | — | REMOVED |
-| **P2** | Vectorize `KVCache.append_swa` | XS | Small/medium (prefill) | NOT STARTED |
-| **P3** | Preallocate `comp_kv`, kill `torch.cat` | S | Critical at long ctx | NOT STARTED |
+| **P2** | Vectorize `KVCache.append_swa` | XS | Small/medium (prefill) | DONE — in single_shot_inference.py |
+| **P3** | Preallocate `comp_kv`, kill `torch.cat` | S | Critical at long ctx | DONE — in single_shot_inference.py |
 | **P4** | `v = k` instead of `v = k.clone()` | XS | Big (memory + BW) | DONE |
-| **P5** | In-place / fused RoPE | S | Medium (-180 launches) | NOT STARTED |
+| **P5** | In-place / fused RoPE | S | Medium (-180 launches) | DONE — in single_shot_inference.py |
 | **P6** | Indexer FP4 tensor-core scoring | L | Critical at long ctx | DEFERRED (E7) |
 | **P7** | Compressor early return + decode buffering | S | Medium | NOT STARTED |
 | **P8** | Production fusion targets | L | Where the real wins live | DEFERRED |