nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	9fec7d609e	Fix gsa_buffer shape mismatch for MoE (M>1 rows) compute_amax_gsa returns a scalar, but quantize_from_buffer expects (M,). Broadcast the scalar gsa to (M,) — all rows use the same gsa (global max).	2026-06-01 21:33:59 +00:00
biondizzle	cacf64232e	CRITICAL FIX: fused_amax_quantize cross-CTA race condition The single-kernel approach used __syncthreads() for cross-CTA amax reduction, but __syncthreads() only syncs within a CTA (same blockIdx). CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa. Result: residual \|X\| exploded to 10^37 by L0. F_attn and F_ffn were 0.0. Fix: Two-kernel approach (correct, zero CPU syncs): Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param). Same P0 win: zero .item() syncs. Two kernel launches instead of one, but correctness > shaving one launch.	2026-06-01 21:26:51 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	360f76b970	Performance audit fixes: eliminate CPU-GPU syncs PERFORMANCE_AUDIT.md validation results: 1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel 2. MoE .item() sync (183/step) → FIXED: same kernel 3. SharedExpert .item() sync (122/step) → FIXED: same kernel 4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly 5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE 6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access 7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck 8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs New files: - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper Net effect: ~915 fewer CPU-GPU syncs per decode step Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics	2026-06-01 20:40:19 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow	2026-06-01 04:59:06 +00:00
biondizzle	c2e3d15633	NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring - Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax) - Add quantize_nvfp4_gpu() bridge in ops/quantize.py - Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda) - Wire GPU quantize into Nvfp4MoE._run_impl(): - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4) - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel) - Non-fused L2: quantize_nvfp4_gpu - Add test_nvfp4_gpu_quantize.py for both kernels	2026-05-25 16:19:07 +00:00
biondizzle	401e24768a	fix: import ceil_div in quantize.py (was NameError at runtime)	2026-05-23 08:40:24 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

8 Commits