nvfp4-megamoe-kernel

Files

biondizzle 360f76b970 Performance audit fixes: eliminate CPU-GPU syncs

PERFORMANCE_AUDIT.md validation results:
  1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel
  2. MoE .item() sync (183/step) → FIXED: same kernel
  3. SharedExpert .item() sync (122/step) → FIXED: same kernel
  4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly
  5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE
  6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access
  7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck
  8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs

New files:
  - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel
  - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper

Net effect: ~915 fewer CPU-GPU syncs per decode step
Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics

2026-06-01 20:40:19 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

custom_ops.py

Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API

2026-05-27 15:15:03 +00:00

gemm_runner.py

fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)

2026-06-01 00:04:48 +00:00

layouts.py

Restructure: cutedsl/ -> dsv4/ with proper layering