biondizzle
360f76b970
Performance audit fixes: eliminate CPU-GPU syncs
PERFORMANCE_AUDIT.md validation results:
1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel
2. MoE .item() sync (183/step) → FIXED: same kernel
3. SharedExpert .item() sync (122/step) → FIXED: same kernel
4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly
5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE
6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access
7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck
8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs
New files:
- dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel
- dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper
Net effect: ~915 fewer CPU-GPU syncs per decode step
Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics
2026-06-01 20:40:19 +00:00
..
2026-05-21 17:30:44 +00:00
2026-05-27 15:15:03 +00:00
2026-06-01 00:04:48 +00:00
2026-05-21 17:30:44 +00:00
2026-06-01 20:40:19 +00:00
2026-05-31 09:17:36 +00:00
2026-06-01 00:00:07 +00:00
2026-05-21 21:54:05 +00:00
2026-05-21 17:30:44 +00:00