nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	676a0448c0	CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption The L1 GEMM produces gate+up combined output with 2intermediate_size BF16 columns, but _l1_out_buf was only allocated with intermediate_size columns. The GEMM wrote past the buffer boundary, corrupting GPU memory and causing cudaErrorInvalidValue on subsequent operations. This was the root cause of ALL the cudaErrorInvalidValue errors in the shared expert and MoE L2 paths — the corrupted memory from the L1 buffer overflow propagated downstream. Fix: _l1_out_buf shape (max_rows, 2intermediate_size) instead of (max_rows, intermediate_size). Applied to both shared_expert.py and moe.py. Also removed all DEBUG sync/print statements from quantize.py and shared_expert.py — the bug was not in the quantize kernels, it was the buffer overflow.	2026-06-04 02:06:18 +00:00
biondizzle	e77455c3ba	DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors	2026-06-04 01:05:47 +00:00
biondizzle	676fad064f	Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature	2026-06-03 21:45:15 +00:00
biondizzle	188ecae47f	CUDA graph: Eliminate per-step allocations in graph-captured code paths - gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers - quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in torch.where() calls (graph-capturable, no memory allocation) - Both fixes prevent 'Disallowed operation during CUDA stream capture' errors during graph capture	2026-06-03 21:30:24 +00:00
biondizzle	91c370360a	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3) Patch torch.cuda.current_device to return the tensor's device index during from_dlpack calls inside CUDA graph capture. This bypasses the device check in __dlpack__ without changing the CUDA stream (which caused 'Capture must end on the same stream' in v1) and without triggering a cross-device copy (which caused 'Cannot copy between CPU and CUDA tensors' in v2).	2026-06-03 21:09:12 +00:00
biondizzle	5c94dbbc37	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2) Previous fix (set_device) caused 'Capture must end on the same stream'. New fix: wrap tensor in _DLPatchTensor during graph capture, which forces dl_device in __dlpack__ to bypass the device check without changing the stream. This enables CUDA graph capture on all 8 GPUs, not just cuda:0.	2026-06-03 20:54:18 +00:00
biondizzle	87b6c9932b	Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture When capturing CUDA graphs on non-default GPUs, torch.cuda.current_device() may not match the tensor's device. from_dlpack() checks this and fails. Fix: set the current device to match the tensor's device before from_dlpack. This enables graph capture on all 8 GPUs, not just cuda:0.	2026-06-03 20:34:24 +00:00
biondizzle	80bb27f5bf	CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode The stride-0 expand view for gsa_gpu caused illegal memory access in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle stride-0 tensors correctly. Fix: - M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc - M>1 prefill (not graph-captured): expand + contiguous — allocation OK	2026-06-03 18:08:18 +00:00
biondizzle	f13a81d48b	CUDA graph: Fix per-call allocations in grouped_linear and quantize 1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle - Same fix as linear.py — avoids torch.zeros per call - Uses correctly-sized view for pad_and_swizzle_single 2. quantize.py: Replace torch.zeros_like with scalar 0.0 - torch.zeros_like allocates a full tensor every call - torch.where(cond, 0.0, x) broadcasts scalar — no allocation	2026-06-03 17:39:20 +00:00
biondizzle	a9ea30353c	CUDA graph: Fix sync violations (Category 1-2) 1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated) - The X_next.abs().max().item() was syncing EVERY layer's post_block - Diagnostics moved to caller (outside graph region) 2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size - _assemble_scales_single_group now uses pre-allocated buffer - Eliminates per-call torch.zeros() allocation (graph capture killer) 3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf - _assemble_scales_single_group no longer allocates 4. quantize.py: Remove .contiguous() from gsa expand - expand() creates stride-0 view, CUDA kernel reads correctly - No allocation on the hot path 5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog	2026-06-03 16:37:20 +00:00
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	f3b551956d	Cleanup Step 2: Archive Lineage P code, fix broken imports - Move dead dsv4/ modules to dsv4/_archive/ (52 files) - model/{dsv4,mtp,layer,layer_schedule} - layers/{embedding,attention,ffn,norm} (kept linear,mhc,router,moe,shared_expert,grouped_linear - live) - cache/, kernels/cache/, kernels/indexer/{csa_indexer,score_topk,compute_valid_lens} - kernels/router/{nvfp4_fused_router,dense_router_decode_kernel,dense_router_prefill} - ops/{topk,topk_select,rope,router}, loader/{hf_checkpoint,layout_convert} - reference/{attention,compressor,csa_attention,moe_pipeline} - kernels/compressor/{compress_tail,csa_hca} - Restore dsv4/ops/{router,custom_ops}.py (needed by live layers) - Fix dsv4/kernels/{indexer,compressor,attention}/__init__.py (removed broken imports) - Remove preload_all() from loader.py (dead, referenced nonexistent .cu file) - Fix loader.py docstring (fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer) - Move broken tests to tests/e2e_archive/ - test_fused_router, production_values_test, e2e/{one_layer,model_construction,csa_hca} - vLLM has 0 imports of dsv4 (Step 0 confirmed)	2026-06-02 19:27:07 +00:00
biondizzle	c926c4a597	P5: Fix mhc_rmsnorm_quantize_nvfp4 — add proper function definition	2026-06-02 17:57:33 +00:00
biondizzle	bdf0b15d45	P4: Fix rmsnorm_quantize_nvfp4 returns QuantizedActivation not tuple	2026-06-02 17:43:21 +00:00
biondizzle	454dbdad52	P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel - fused_mhc_rmsnorm_quantize.cu: 2-kernel approach Kernel 1: mhc_rmsnorm_amax_gsa — bmm + RMS + amax → gsa Kernel 2: mhc_rmsnorm_quantize_nvfp4 — bmm + normalize + quantize - Python bridge: mhc_rmsnorm_quantize_nvfp4() in ops/quantize.py - Unit test: test_fused_mhc_rmsnorm_quantize.py (production shapes) - Eliminates ~610 kernel launches per token (122 sites × 5 launches saved)	2026-06-02 16:39:42 +00:00
biondizzle	0d1cd1e216	P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized - QuantizedActivation: carries (x_fp4, x_sf, gsa) for skip-quantize path - Nvfp4Linear.run_from_quantized(): runs GEMM with pre-quantized input - Enables fused RMSNorm+quantize to feed directly into all downstream linears (q_a, kv, o_proj, etc.) without re-quantizing	2026-06-02 16:37:38 +00:00
biondizzle	57ab4b9d4c	P4: Fix dequantize_nvfp4 bridge — handle float8_e4m3fn dtype	2026-06-02 16:31:56 +00:00
biondizzle	794ebaf7e5	P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+) - fused_rmsnorm_quantize.cu: two-kernel approach Kernel 1: rmsnorm_amax_gsa — compute RMS + amax of normalized output → gsa per row Kernel 2: rmsnorm_quantize_nvfp4 — normalize + quantize using GPU-computed gsa - Python bridge: rmsnorm_quantize_nvfp4() in ops/quantize.py - Python bridge: dequantize_nvfp4() in ops/quantize.py - Unit test: test_fused_rmsnorm_quantize.py (production shapes: 7168 hidden) - Eliminates ~488 kernel launches per token (122 sites × 4 launches saved)	2026-06-02 16:26:24 +00:00
biondizzle	54a9b6961b	fix: rope_cuda path — kernels/cuda not ops/cuda	2026-06-02 09:06:36 +00:00
biondizzle	2bbbead984	P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops) New files: - dsv4/kernels/cuda/rope_cuda.cu: GPT-J interleaved RoPE kernel (forward+inverse) - dsv4/ops/rope_cuda.py: Python bridge with ctypes loading - tests/unit/test_rope_cuda.py: correctness test (cos >= 0.999998) Savings: ~915 launches/token → 183 launches/token	2026-06-02 09:05:22 +00:00
biondizzle	9fec7d609e	Fix gsa_buffer shape mismatch for MoE (M>1 rows) compute_amax_gsa returns a scalar, but quantize_from_buffer expects (M,). Broadcast the scalar gsa to (M,) — all rows use the same gsa (global max).	2026-06-01 21:33:59 +00:00
biondizzle	cacf64232e	CRITICAL FIX: fused_amax_quantize cross-CTA race condition The single-kernel approach used __syncthreads() for cross-CTA amax reduction, but __syncthreads() only syncs within a CTA (same blockIdx). CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa. Result: residual \|X\| exploded to 10^37 by L0. F_attn and F_ffn were 0.0. Fix: Two-kernel approach (correct, zero CPU syncs): Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param). Same P0 win: zero .item() syncs. Two kernel launches instead of one, but correctness > shaving one launch.	2026-06-01 21:26:51 +00:00
biondizzle	c8faf20a99	P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path Fused kernels (zero CPU sync, single kernel launch per projection): - fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync). - fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path. Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu + deinterleave_quantize_nvfp4_cuda (had .item() sync). All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache). Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call, ~500 calls/token). Now compiles once and reuses the cached module. Updated layers: - linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer - moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and non-fused paths) - shared_expert.py: fused for L1 and L2 - quantize.py: All functions use module loader cache - sampler.py: Uses module loader cache - indexer/score_topk.py: Uses module loader cache P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop. 2 kernel launches instead of 2T. No .item() in comp_pos either. P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat. max_comp=32768 per layer (32MB). No more quadratic memory growth. ~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).	2026-06-01 21:05:03 +00:00
biondizzle	360f76b970	Performance audit fixes: eliminate CPU-GPU syncs PERFORMANCE_AUDIT.md validation results: 1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel 2. MoE .item() sync (183/step) → FIXED: same kernel 3. SharedExpert .item() sync (122/step) → FIXED: same kernel 4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly 5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE 6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access 7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck 8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs New files: - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper Net effect: ~915 fewer CPU-GPU syncs per decode step Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics	2026-06-01 20:40:19 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow	2026-06-01 04:59:06 +00:00
biondizzle	563df02aef	fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)	2026-06-01 00:04:48 +00:00
biondizzle	be476b2ce2	router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init	2026-06-01 00:00:07 +00:00
biondizzle	1c18c16c68	Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16	2026-05-31 09:17:36 +00:00
biondizzle	300dddedc0	E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test E1: LayerCacheHandle now exposes gather_compressed_kv, gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim. Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu. Python wrapper in dsv4/kernels/cache/gather.py. E2: tests/e2e/test_one_layer.py — SWA path smoke test. E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs for CSA/HCA compress_and_store, compute_index_scores_topk). E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path. Error checking via C API return code instead. Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).	2026-05-30 21:10:26 +00:00
biondizzle	4b9eed02e1	Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files - Deleted fmha.py (CuTeDSL slow path), FmhaKernel, Python KV merge - Deleted fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh - Moved fmha_qk_verify.cuh to tests/unit/qk_verify_kernel.cuh - Deleted decode_sparse.py, decode_swa.py, kernels/decode/ - Deleted 46 test_d.py probes, test_smem_, test_cotiled_, test_tmem_, test_smem_p_, test_ultra_minimal, test_fmha_pv16, test_working_softmax_maybe - Deleted root scratch: debug_linear.py, test_mapping.py, run_router_tests.py - Moved archive/ to archived_plans/code_archive/ - Rewrote production.py: single fast path via 6-warp multi-tile kernel - Added STATUS.md, audit_attention_live.md - Moved NEXT_PRIORITIES.md to archived_plans/	2026-05-30 21:08:12 +00:00
biondizzle	b9f15c250f	Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API - production.py: head-packed M dimension for MQA/GQA (q_per_kv*T rows in single launch per KV group, eliminating redundant K/V TMA loads) - production.py: batch dimension support (outer Python loop) - production.py: warmup_attention_kernels() for pre-compilation - production.py: dsv4_attention_per_head() for exact per-head sink bias - __init__.py: sparse_fmha_with_swa, dense_fmha_with_swa, swa_only_fmha integration functions bridging AttentionSubBlock → production FMHA - custom_ops.py: dsv4::sparse_fmha_with_swa custom_op registration - test_production.py: comprehensive tests (MHA/MQA/GQA, head-packed vs per-head parity, multi-segment KV, SWA+causal+sink, batch, edge cases)	2026-05-27 15:15:03 +00:00
biondizzle	d53e0a33a9	NVFP4-3: add use_2cta_instrs conditional to gemm_runner - run_nvfp4_grouped_gemm: use_2cta = tokens_sum >= 256 and cluster_m even - run_fused_swiglu_grouped_gemm: same conditional - Auto-warms up on first use via lazy compilation cache - 1.7-1.9× throughput at prefill shapes (M>=256) - Decode (M<256) stays 1-CTA (correct, no waste)	2026-05-25 16:42:02 +00:00
biondizzle	c2e3d15633	NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring - Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax) - Add quantize_nvfp4_gpu() bridge in ops/quantize.py - Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda) - Wire GPU quantize into Nvfp4MoE._run_impl(): - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4) - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel) - Non-fused L2: quantize_nvfp4_gpu - Add test_nvfp4_gpu_quantize.py for both kernels	2026-05-25 16:19:07 +00:00
biondizzle	e3e67c3992	NVFP4-3: enable 2-CTA UMMA when MMA tile M >= 256 (1.7-1.9x throughput)	2026-05-24 22:57:49 +00:00
biondizzle	401e24768a	fix: import ceil_div in quantize.py (was NameError at runtime)	2026-05-23 08:40:24 +00:00
biondizzle	abfe4485f7	Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill Step 1: Hash router (hash_router.cu) - One thread per token, gather from [vocab_size, k] LUT - Uniform 1/k weights, FP32 output - 3 MB LUT fits in L2 for repeated decode calls Step 2: topk_select.cu — general top-k primitive - Per-thread register min-heap (k=6, compile-time unrolled) - Shared memory merge: thread 0 merges 64 partial heaps - Tie-breaking: lower index wins on equal scores - Reusable by CSA indexer Step 3: activation_topk.cu — fused sqrt(softplus) + bias + topk + renorm - Single kernel: all 6 steps of the router math, no intermediate buffers - Numerically stable softplus: max(x,0) + log1p(exp(-\|x\|)) - Per-thread heap with unbiased activation co-stored - Shared memory merge → sort descending → renormalize → store Step 4: dense_router_decode.py — CuTeDSL fused GEMM kernel (skeleton) - BF16 GEMM with tcgen05.mma, FP32 accumulator - Custom epilogue: activation + bias + top-k (structure defined, needs TMA/MMA boilerplate) - Dispatch: N<=64 uses fused decode, N>64 uses prefill path Step 5: dense_router_prefill.py — prefill path - torch.nn.functional.linear for GEMM (DeepGEMM integration deferred) - Calls activation_topk for fused post-GEMM processing Step 6: Router class + ops/router.py + test_router.py - Router: construction-time mode (dense/hash), weight loading, custom_op dispatch - ops/router.py: torch.library.custom_op wrappers, integer-keyed registry - test_router.py: spec oracle tests (DO NOT RUN — Carmine is testing Stage C) Test strategy: each kernel tested against its mathematical spec in FP32. No reference implementation, no two debug streams. The oracle IS the math.	2026-05-21 21:54:05 +00:00
biondizzle	3fb3c925af	Restructure: cutedsl/ -> dsv4/ with proper layering - Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py - Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc. - Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda) - Moved PyTorch bridges to dsv4/ops/ - Moved nn.Module layers to dsv4layers/ - Moved reference implementations to dsv4/reference/ - Moved vendored CUTLASS code to vendored/ - Archived ~190 debug tests to tests/archive/ - Kept ~15 canonical tests in tests/unit/ - Updated all import paths - Added stubs for future components (model/, cache/, loader/) - Updated pyproject.toml: dsv4-inference package name	2026-05-21 17:30:44 +00:00

37 Commits