nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	676a0448c0	CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption The L1 GEMM produces gate+up combined output with 2intermediate_size BF16 columns, but _l1_out_buf was only allocated with intermediate_size columns. The GEMM wrote past the buffer boundary, corrupting GPU memory and causing cudaErrorInvalidValue on subsequent operations. This was the root cause of ALL the cudaErrorInvalidValue errors in the shared expert and MoE L2 paths — the corrupted memory from the L1 buffer overflow propagated downstream. Fix: _l1_out_buf shape (max_rows, 2intermediate_size) instead of (max_rows, intermediate_size). Applied to both shared_expert.py and moe.py. Also removed all DEBUG sync/print statements from quantize.py and shared_expert.py — the bug was not in the quantize kernels, it was the buffer overflow.	2026-06-04 02:06:18 +00:00
biondizzle	e77455c3ba	DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors	2026-06-04 01:05:47 +00:00
biondizzle	188ecae47f	CUDA graph: Eliminate per-step allocations in graph-captured code paths - gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers - quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in torch.where() calls (graph-capturable, no memory allocation) - Both fixes prevent 'Disallowed operation during CUDA stream capture' errors during graph capture	2026-06-03 21:30:24 +00:00
biondizzle	80bb27f5bf	CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode The stride-0 expand view for gsa_gpu caused illegal memory access in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle stride-0 tensors correctly. Fix: - M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc - M>1 prefill (not graph-captured): expand + contiguous — allocation OK	2026-06-03 18:08:18 +00:00
biondizzle	f13a81d48b	CUDA graph: Fix per-call allocations in grouped_linear and quantize 1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle - Same fix as linear.py — avoids torch.zeros per call - Uses correctly-sized view for pad_and_swizzle_single 2. quantize.py: Replace torch.zeros_like with scalar 0.0 - torch.zeros_like allocates a full tensor every call - torch.where(cond, 0.0, x) broadcasts scalar — no allocation	2026-06-03 17:39:20 +00:00
biondizzle	a9ea30353c	CUDA graph: Fix sync violations (Category 1-2) 1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated) - The X_next.abs().max().item() was syncing EVERY layer's post_block - Diagnostics moved to caller (outside graph region) 2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size - _assemble_scales_single_group now uses pre-allocated buffer - Eliminates per-call torch.zeros() allocation (graph capture killer) 3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf - _assemble_scales_single_group no longer allocates 4. quantize.py: Remove .contiguous() from gsa expand - expand() creates stride-0 view, CUDA kernel reads correctly - No allocation on the hot path 5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog	2026-06-03 16:37:20 +00:00
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	c926c4a597	P5: Fix mhc_rmsnorm_quantize_nvfp4 — add proper function definition	2026-06-02 17:57:33 +00:00
biondizzle	bdf0b15d45	P4: Fix rmsnorm_quantize_nvfp4 returns QuantizedActivation not tuple	2026-06-02 17:43:21 +00:00
biondizzle	454dbdad52	P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - fused_mhc_rmsnorm_quantize.cu: 2-kernel approach Kernel 1: mhc_rmsnorm_amax_gsa — bmm + RMS + amax → gsa Kernel 2: mhc_rmsnorm_quantize_nvfp4 — bmm + normalize + quantize - Python bridge: mhc_rmsnorm_quantize_nvfp4() in ops/quantize.py - Unit test: test_fused_mhc_rmsnorm_quantize.py (production shapes) - Eliminates ~610 kernel launches per token (122 sites × 5 launches saved)	2026-06-02 16:39:42 +00:00
biondizzle	0d1cd1e216	P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized - QuantizedActivation: carries (x_fp4, x_sf, gsa) for skip-quantize path - Nvfp4Linear.run_from_quantized(): runs GEMM with pre-quantized input - Enables fused RMSNorm+quantize to feed directly into all downstream linears (q_a, kv, o_proj, etc.) without re-quantizing	2026-06-02 16:37:38 +00:00
biondizzle	57ab4b9d4c	P4: Fix dequantize_nvfp4 bridge — handle float8_e4m3fn dtype	2026-06-02 16:31:56 +00:00
biondizzle	794ebaf7e5	P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+) - fused_rmsnorm_quantize.cu: two-kernel approach Kernel 1: rmsnorm_amax_gsa — compute RMS + amax of normalized output → gsa per row Kernel 2: rmsnorm_quantize_nvfp4 — normalize + quantize using GPU-computed gsa - Python bridge: rmsnorm_quantize_nvfp4() in ops/quantize.py - Python bridge: dequantize_nvfp4() in ops/quantize.py - Unit test: test_fused_rmsnorm_quantize.py (production shapes: 7168 hidden) - Eliminates ~488 kernel launches per token (122 sites × 4 launches saved)	2026-06-02 16:26:24 +00:00
biondizzle	9fec7d609e	Fix gsa_buffer shape mismatch for MoE (M>1 rows) compute_amax_gsa returns a scalar, but quantize_from_buffer expects (M,). Broadcast the scalar gsa to (M,) — all rows use the same gsa (global max).	2026-06-01 21:33:59 +00:00
biondizzle	cacf64232e	CRITICAL FIX: fused_amax_quantize cross-CTA race condition The single-kernel approach used __syncthreads() for cross-CTA amax reduction, but __syncthreads() only syncs within a CTA (same blockIdx). CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa. Result: residual \|X\| exploded to 10^37 by L0. F_attn and F_ffn were 0.0. Fix: Two-kernel approach (correct, zero CPU syncs): Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param). Same P0 win: zero .item() syncs. Two kernel launches instead of one, but correctness > shaving one launch.	2026-06-01 21:26:51 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	360f76b970	Performance audit fixes: eliminate CPU-GPU syncs PERFORMANCE_AUDIT.md validation results: 1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel 2. MoE .item() sync (183/step) → FIXED: same kernel 3. SharedExpert .item() sync (122/step) → FIXED: same kernel 4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly 5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE 6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access 7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck 8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs New files: - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper Net effect: ~915 fewer CPU-GPU syncs per decode step Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics	2026-06-01 20:40:19 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow	2026-06-01 04:59:06 +00:00
biondizzle	c2e3d15633	NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring - Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax) - Add quantize_nvfp4_gpu() bridge in ops/quantize.py - Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda) - Wire GPU quantize into Nvfp4MoE._run_impl(): - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4) - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel) - Non-fused L2: quantize_nvfp4_gpu - Add test_nvfp4_gpu_quantize.py for both kernels	2026-05-25 16:19:07 +00:00
biondizzle	401e24768a	fix: import ceil_div in quantize.py (was NameError at runtime)	2026-05-23 08:40:24 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

21 Commits