Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done

This commit is contained in:
2026-06-01 22:21:31 +00:00
parent 583ad6cfe6
commit 828ba73dff

View File

@@ -25,27 +25,17 @@ The rest of this doc identifies where it actually is.
## WORK IN PROGRESS — What Was Being Done (Session 2026-06-01 20:21 UTC)
### Completed fixes (committed, pushed, NOT YET TESTED):
### Completed fixes (committed, pushed, NOT YET TESTED ON B200):
1. **P0 (partial)**: Added `dsv4/kernels/cuda/amax_gsa.cu` — a GPU-only kernel
that computes `gsa = max(|x|) / 2688` without CPU sync. Returns a scalar
GPU tensor. Updated `dsv4/ops/quantize.py` with `compute_amax_gsa_gpu()` wrapper.
Updated `dsv4/layers/linear.py` Nvfp4Linear.run() to use it.
Updated `dsv4/layers/moe.py` Nvfp4MoE._run_impl() to use it (3 call sites).
Updated `dsv4/layers/shared_expert.py` Nvfp4SharedExpert.run() to use it (2 call sites).
**CAVEAT**: The fix is NOT complete. `quantize_nvfp4_gpu()` still takes a
Python float for `global_scale`, so we still need `.item()` once per
projection to pass it to the quantize kernel. However, the CuTeDSL GEMM's
`global_scale_a` is already a GPU tensor (`torch.ones(1, device=device)`),
so the GEMM path is sync-free. The remaining `.item()` syncs are only for
the quantize kernel parameter — ~10 per layer instead of ~10 per projection.
**TO COMPLETE**: Modify `quantize_nvfp4.cu` to accept global_scale from a
GPU buffer instead of a kernel parameter, OR fuse the amax+quantize into
a single kernel that writes both FP4 output AND gsa to a GPU buffer.
The `fused_amax_quantize.cu` file was started but deleted — needs to be
done properly.
1. **P0 (COMPLETE)**: ALL `.item()` CPU-GPU syncs eliminated from NVFP4 activation path.
- `dsv4/kernels/cuda/amax_gsa.cu`: GPU-only amax→gsa kernel
- `dsv4/kernels/cuda/fused_amax_quantize.cu`: quantize with gsa from GPU buffer
- `dsv4/ops/quantize.py`: `quantize_nvfp4_gpu_fused()` two kernel launches, zero CPU syncs
- `dsv4/layers/linear.py` Nvfp4Linear: uses `quantize_nvfp4_gpu_fused`
- `dsv4/layers/grouped_linear.py` Nvfp4GroupedLinear: uses `quantize_nvfp4_gpu_fused` (was last holdout)
- `dsv4/layers/moe.py` Nvfp4MoE: uses `quantize_nvfp4_gpu_fused`
- `dsv4/layers/shared_expert.py` Nvfp4SharedExpert: uses `quantize_nvfp4_gpu_fused`
- Hot-path D2H sync count: ~486 → ≤ 5 (argmax + token decode)
2. **P4 (done)**: Changed `v = k.clone()` to `v = k` in `single_shot_inference.py:320`.
The `.transpose(-1,-2).contiguous()` in `dsv4_attention` already creates
@@ -375,12 +365,12 @@ where a complete block is available.
| # | Item | Effort | Win | Status |
|---|---|---|---|---|
| **P0** | Kill `.item()` in `_use_runtime_gsa` | S | **Huge** (~24 ms/token) | PARTIAL — amax_gsa kernel written, GEMM path sync-free, quantize kernel still needs `.item()` |
| **P0** | Kill `.item()` in `_use_runtime_gsa` | S | **Huge** (~24 ms/token) | COMPLETE — all paths use quantize_nvfp4_gpu_fused, zero CPU syncs |
| **P1** | ~~REMOVED~~ — multi-GPU layout is correct for reference | — | — | REMOVED |
| **P2** | Vectorize `KVCache.append_swa` | XS | Small/medium (prefill) | NOT STARTED |
| **P3** | Preallocate `comp_kv`, kill `torch.cat` | S | Critical at long ctx | NOT STARTED |
| **P2** | Vectorize `KVCache.append_swa` | XS | Small/medium (prefill) | DONE — in single_shot_inference.py |
| **P3** | Preallocate `comp_kv`, kill `torch.cat` | S | Critical at long ctx | DONE — in single_shot_inference.py |
| **P4** | `v = k` instead of `v = k.clone()` | XS | Big (memory + BW) | DONE |
| **P5** | In-place / fused RoPE | S | Medium (-180 launches) | NOT STARTED |
| **P5** | In-place / fused RoPE | S | Medium (-180 launches) | DONE — in single_shot_inference.py |
| **P6** | Indexer FP4 tensor-core scoring | L | Critical at long ctx | DEFERRED (E7) |
| **P7** | Compressor early return + decode buffering | S | Medium | NOT STARTED |
| **P8** | Production fusion targets | L | Where the real wins live | DEFERRED |