Commit Graph

8 Commits

Author SHA1 Message Date
9fec7d609e Fix gsa_buffer shape mismatch for MoE (M>1 rows)
compute_amax_gsa returns a scalar, but quantize_from_buffer expects (M,).
Broadcast the scalar gsa to (M,) — all rows use the same gsa (global max).
2026-06-01 21:33:59 +00:00
cacf64232e CRITICAL FIX: fused_amax_quantize cross-CTA race condition
The single-kernel approach used __syncthreads() for cross-CTA amax
reduction, but __syncthreads() only syncs within a CTA (same blockIdx).
CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa.

Result: residual |X| exploded to 10^37 by L0. F_attn and F_ffn were 0.0.

Fix: Two-kernel approach (correct, zero CPU syncs):
  Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor
  Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer

The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and
deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param).

Same P0 win: zero .item() syncs. Two kernel launches instead of one,
but correctness > shaving one launch.
2026-06-01 21:26:51 +00:00
c8faf20a99 P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path
Fused kernels (zero CPU sync, single kernel launch per projection):
- fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step
  compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync).
- fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path.
  Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu
  + deinterleave_quantize_nvfp4_cuda (had .item() sync).

All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache).
Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call,
~500 calls/token). Now compiles once and reuses the cached module.

Updated layers:
- linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer
- moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and
  non-fused paths)
- shared_expert.py: fused for L1 and L2
- quantize.py: All functions use module loader cache
- sampler.py: Uses module loader cache
- indexer/score_topk.py: Uses module loader cache

P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop.
2 kernel launches instead of 2T. No .item() in comp_pos either.

P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat.
max_comp=32768 per layer (32MB). No more quadratic memory growth.

~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).
2026-06-01 21:05:03 +00:00
360f76b970 Performance audit fixes: eliminate CPU-GPU syncs
PERFORMANCE_AUDIT.md validation results:
  1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel
  2. MoE .item() sync (183/step) → FIXED: same kernel
  3. SharedExpert .item() sync (122/step) → FIXED: same kernel
  4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly
  5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE
  6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access
  7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck
  8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs

New files:
  - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel
  - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper

Net effect: ~915 fewer CPU-GPU syncs per decode step
Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics
2026-06-01 20:40:19 +00:00
6e53e3007c fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow 2026-06-01 04:59:06 +00:00
c2e3d15633 NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring
- Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax)
- Add quantize_nvfp4_gpu() bridge in ops/quantize.py
- Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda)
- Wire GPU quantize into Nvfp4MoE._run_impl():
  - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4)
  - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel)
  - Non-fused L2: quantize_nvfp4_gpu
- Add test_nvfp4_gpu_quantize.py for both kernels
2026-05-25 16:19:07 +00:00
401e24768a fix: import ceil_div in quantize.py (was NameError at runtime) 2026-05-23 08:40:24 +00:00
3fb3c925af Restructure: cutedsl/ -> dsv4/ with proper layering
- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name
2026-05-21 17:30:44 +00:00