nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	f259d63930	CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path	2026-06-06 07:01:52 +00:00
biondizzle	ae26f6b83c	Fix dense router BF16 dispatch: use torch.matmul instead of F.linear - F.linear(x, W) computes x @ W.T which caused shape mismatch when W_gate was pre-transposed to [E, H] - Use torch.matmul(x, W_gate) instead — computes x @ W directly, no transpose needed, no FP32 conversion, fully graph-capturable - W_gate stays as [H, E] (original checkpoint shape)	2026-06-04 05:58:24 +00:00
biondizzle	e46b615873	Fix dense router BF16 dispatch for CUDA graph capture - Run GEMM in BF16 (not FP32) during graph capture — Blackwell tensor cores handle BF16 natively; FP32 GEMM triggers cudaErrorStreamCaptureUnsupported - Pre-transpose W_gate to [E, H] at load time — avoids .T view during capture - Convert only logits output to FP32 for sqrt(softplus) numerical stability - This fixes the graph capture failure at layer 0 Graph B	2026-06-04 05:50:13 +00:00
biondizzle	ffa7842b58	Fix dense router: run GEMM in BF16, convert to FP32 only for activation hidden_states.float() and gate_bf16.T.float() create new FP32 tensors during CUDA graph capture, which is not graph-capturable. Fix: run the linear in BF16 (Blackwell tensor cores handle BF16 natively), then convert only the output logits to FP32 for numerical stability in sqrt(softplus). The single logits.float() is graph-capturable because it's a unary op with a pre-existing output buffer.	2026-06-04 04:49:08 +00:00
biondizzle	119e6d471e	Add safety check for swizzled buffers: fall through to Python path if None	2026-06-04 04:32:00 +00:00
biondizzle	fae61d3ef7	Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream	2026-06-04 04:13:40 +00:00
biondizzle	ee86969f6c	Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch	2026-06-04 03:57:59 +00:00
biondizzle	e26c28a1ce	Fix CUDA stream API: getCurrentCUDAStream().stream()	2026-06-04 03:43:04 +00:00
biondizzle	9b3917e248	Fix blackwell_swizzle.cu: add pybind11 bindings for torch extension loader	2026-06-04 03:29:10 +00:00
biondizzle	5487a58df4	Fix NameError: add rows/cols variables to MoE swizzle	2026-06-04 03:14:27 +00:00
biondizzle	a434545d12	Blackwell swizzle CUDA kernel for CUDA graph capture Python view operations (reshape, transpose, permute) are not graph-capturable — they cause cudaErrorStreamCaptureUnsupported. Added: - dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle - to_blocked(): detects graph capture, uses CUDA kernel instead of Python views - MoE _assemble_scales_cudagraph_safe: same treatment - Shared expert _assemble_scales_single_group: same treatment - Linear _assemble_scales_single_group: same treatment - Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like) The CUDA kernel writes to a pre-allocated buffer — no per-step allocations. Eager path unchanged (still uses fast Python view operations).	2026-06-04 03:03:02 +00:00
biondizzle	e7766254b7	Pre-allocate ALL GEMM output buffers for CUDA graph capture Every run_nvfp4_grouped_gemm call must pass out= with a pre-allocated buffer. During CUDA graph capture, torch.zeros() allocations are forbidden — they cause 'cudaErrorStreamCaptureUnsupported' errors. Added: - shared_expert: _l2_out_buf for L2 GEMM - shared_expert: pass out= for both L1 and L2 GEMM calls - moe: _l2_out_buf for L2 GEMM - moe: pass out= for unfused L1 GEMM (fused L1 already had it) - moe: pass out= for L2 GEMM - linear: _gemm_out_buf for all GEMM calls - linear: pass out= for both run() and run_from_quantized() paths grouped_linear already had _output_buf_padded — no changes needed.	2026-06-04 02:41:59 +00:00
biondizzle	676a0448c0	CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption The L1 GEMM produces gate+up combined output with 2intermediate_size BF16 columns, but _l1_out_buf was only allocated with intermediate_size columns. The GEMM wrote past the buffer boundary, corrupting GPU memory and causing cudaErrorInvalidValue on subsequent operations. This was the root cause of ALL the cudaErrorInvalidValue errors in the shared expert and MoE L2 paths — the corrupted memory from the L1 buffer overflow propagated downstream. Fix: _l1_out_buf shape (max_rows, 2intermediate_size) instead of (max_rows, intermediate_size). Applied to both shared_expert.py and moe.py. Also removed all DEBUG sync/print statements from quantize.py and shared_expert.py — the bug was not in the quantize kernels, it was the buffer overflow.	2026-06-04 02:06:18 +00:00
biondizzle	0890e578f4	DEBUG: print l1_out shape before gate/up split	2026-06-04 01:49:12 +00:00
biondizzle	8546ed725f	DEBUG: check SE input magnitude	2026-06-04 01:38:24 +00:00
biondizzle	26ecf96328	DEBUG: check intermediate magnitude before SE L2	2026-06-04 01:30:29 +00:00
biondizzle	5303d6a82f	DEBUG: test copy_ with contiguous slice vs scalar assign for gsa	2026-06-04 01:27:25 +00:00
biondizzle	ccbc713658	DEBUG: check gsa values and pinpoint exact failing operation	2026-06-04 01:16:37 +00:00
biondizzle	e77455c3ba	DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors	2026-06-04 01:05:47 +00:00
biondizzle	55def5eef9	Restore A/B split + gsa scalar fix (error is pre-existing, not regression)	2026-06-04 01:03:36 +00:00
biondizzle	59eccd04ab	REVERT: test if cudaErrorInvalidValue is pre-existing or regression	2026-06-04 00:53:09 +00:00
biondizzle	5e3ced0b60	DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path	2026-06-04 00:41:28 +00:00
biondizzle	b314fde9b7	Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment The pattern causes cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view (e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1). Root cause: copy_() from an expanded/reshaped view can fail when the source tensor has non-standard strides. The expand() operation creates a view with stride-0 dimensions that copy_() may not handle correctly on all CUDA versions. Fix: Replace all gsa copy_ patterns with scalar assignment: self._gsa_buf[0] = gsa_gpu[0] # scalar GPU→GPU, graph-capturable This is simpler, avoids view issues, and is CUDA-graph-compatible. Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py	2026-06-04 00:30:21 +00:00
biondizzle	993bb345d1	DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug	2026-06-04 00:15:38 +00:00
biondizzle	f0f87df906	DEBUG: add sync + shape prints to shared_expert L2 gsa copy	2026-06-04 00:05:08 +00:00
biondizzle	a468f72a0e	CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert Pass out= parameter to run_fused_swiglu_grouped_gemm to avoid per-step torch.zeros() allocation during CUDA graph capture.	2026-06-03 23:17:43 +00:00
biondizzle	f57de06eb5	Fix grouped_linear GEMM output buffer shape and extraction - _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output - Extraction: groups are stacked vertically, not horizontally - Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns	2026-06-03 22:26:40 +00:00
biondizzle	b32713c302	grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture) Add _output_buf_padded for the flat GEMM output, pass as out= parameter to run_nvfp4_grouped_gemm to avoid per-step torch.zeros() allocation.	2026-06-03 22:02:01 +00:00
biondizzle	676fad064f	Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature	2026-06-03 21:45:15 +00:00
biondizzle	188ecae47f	CUDA graph: Eliminate per-step allocations in graph-captured code paths - gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers - quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in torch.where() calls (graph-capturable, no memory allocation) - Both fixes prevent 'Disallowed operation during CUDA stream capture' errors during graph capture	2026-06-03 21:30:24 +00:00
biondizzle	91c370360a	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3) Patch torch.cuda.current_device to return the tensor's device index during from_dlpack calls inside CUDA graph capture. This bypasses the device check in __dlpack__ without changing the CUDA stream (which caused 'Capture must end on the same stream' in v1) and without triggering a cross-device copy (which caused 'Cannot copy between CPU and CUDA tensors' in v2).	2026-06-03 21:09:12 +00:00
biondizzle	5c94dbbc37	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2) Previous fix (set_device) caused 'Capture must end on the same stream'. New fix: wrap tensor in _DLPatchTensor during graph capture, which forces dl_device in __dlpack__ to bypass the device check without changing the stream. This enables CUDA graph capture on all 8 GPUs, not just cuda:0.	2026-06-03 20:54:18 +00:00
biondizzle	87b6c9932b	Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture When capturing CUDA graphs on non-default GPUs, torch.cuda.current_device() may not match the tensor's device. from_dlpack() checks this and fails. Fix: set the current device to match the tensor's device before from_dlpack. This enables graph capture on all 8 GPUs, not just cuda:0.	2026-06-03 20:34:24 +00:00
biondizzle	80bb27f5bf	CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode The stride-0 expand view for gsa_gpu caused illegal memory access in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle stride-0 tensors correctly. Fix: - M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc - M>1 prefill (not graph-captured): expand + contiguous — allocation OK	2026-06-03 18:08:18 +00:00
biondizzle	518a1d3f95	CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount 1. scatter_add_ requires int64 indices — ensure sorted_ids is .long() 2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern 3. Both code paths now use pre-allocated _tokens_per_expert_buf	2026-06-03 17:53:40 +00:00
biondizzle	f13a81d48b	CUDA graph: Fix per-call allocations in grouped_linear and quantize 1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle - Same fix as linear.py — avoids torch.zeros per call - Uses correctly-sized view for pad_and_swizzle_single 2. quantize.py: Replace torch.zeros_like with scalar 0.0 - torch.zeros_like allocates a full tensor every call - torch.where(cond, 0.0, x) broadcasts scalar — no allocation	2026-06-03 17:39:20 +00:00
biondizzle	84655d066a	CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4 ) 1. Replace torch.bincount with scatter_add_ into pre-allocated buffer - bincount produces data-dependent shapes → breaks graph capture - scatter_add_ with pre-allocated _tokens_per_expert_buf (fixed shape) - Pre-allocated _ones_buf to avoid per-call torch.ones() 2. Replace torch.full for l1_gsa with pre-allocated buffer + fill_ - torch.full allocates every call → breaks graph capture - Use self._l1_gsa_buf.fill_(l1_gs) instead	2026-06-03 17:37:03 +00:00
biondizzle	df05289d6f	CUDA graph: Fix remaining sync violations from B200 detector run 2 1. grouped_linear.py: Remove conditional host read of GPU tensor - 'if group_offsets[0] != 0' reads GPU value on host → sync - Fix: unconditionally update offsets every call (GPU-only multiply) 2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer - dec_tid_buf[0] = python_int → CPU→GPU sync - Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable) 3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)	2026-06-03 17:20:34 +00:00
biondizzle	e07d79868f	CUDA graph: Fix _assemble_scales_single_group swizzle size The pre-allocated buffer is max-sized, but pad_and_swizzle_single operates on the full buffer dimensions. Fix: pass a correctly-sized view (buf[:padded_rows, :padded_cols]) so the swizzle produces the right output size. Same fix applied to both linear.py and shared_expert.py.	2026-06-03 17:02:34 +00:00
biondizzle	0ca7bed0e1	CUDA graph: Fix sync violations found by B200 detector Fixes from running Section A detector on B200: 1. single_shot_inference.py: Use pinned CPU buffers for token/position transfer - dec_tid_buf[0] = python_int causes CPU→GPU sync - Fixed: write to pinned CPU buffer, then copy_ (async, graph-capturable) 2. grouped_linear.py: Fix expert_offsets Python loop - expert_offsets[g] = python_int * padded_rows → CPU→GPU sync per iteration - Fixed: element-wise multiply with pre-allocated range tensor (GPU-only) 3. grouped_linear.py: Vectorized output extraction for T=1 decode - Python loop z[:, g, :] = out[...] → CPU sync for each slice - Fixed: GPU gather with pre-computed indices for T=1 4. grouped_linear.py: Pre-allocate output buffer - torch.empty() per call → allocation inside graph - Fixed: use self._output_buf (pre-allocated at max size) 5. grouped_linear.py: Pre-allocate expert_offsets_range_buf - torch.arange() per call → allocation inside graph - Fixed: compute once at init, reuse via element-wise multiply	2026-06-03 16:52:19 +00:00
biondizzle	46a3a51832	CUDA graph: Fix per-step allocations in decode loop 1. mHCLayer.init_state: Add out_buf parameter for in-place write - Pre-allocated dec_X_buf (1, 4, 7168) on cuda:0 - Eliminates .unsqueeze().expand().clone() allocation each step 2. single_shot_inference.py: Pre-allocate dec_embed_buf - Placeholder for embedding output (graph capture will use this) 3. Note: Cross-GPU X.to() transfers still allocate per step - This requires per-GPU X buffers (part of graph capture architecture)	2026-06-03 16:38:35 +00:00
biondizzle	a9ea30353c	CUDA graph: Fix sync violations (Category 1-2) 1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated) - The X_next.abs().max().item() was syncing EVERY layer's post_block - Diagnostics moved to caller (outside graph region) 2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size - _assemble_scales_single_group now uses pre-allocated buffer - Eliminates per-call torch.zeros() allocation (graph capture killer) 3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf - _assemble_scales_single_group no longer allocates 4. quantize.py: Remove .contiguous() from gsa expand - expand() creates stride-0 view, CUDA kernel reads correctly - No allocation on the hot path 5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog	2026-06-03 16:37:20 +00:00
biondizzle	ca5bc814d5	Fix compressor: do not add positional bias to KV content The positional bias (ape/B) should only modulate the compression softmax logits (Z + B), NOT be added to the KV content itself. Paper equation: compressed = softmax(Z + B) · C Bug was doing: compressed = softmax(Z + B) · (C + B) — poisons every compressed KV entry with learned positional-bias content. Fixed in both CSA (compress_csa_reduce_kernel) and HCA (hca_compress_reduce_kernel) paths in compressor_reduce.cu.	2026-06-03 15:52:00 +00:00
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	75288bd12f	Wire prefill FMHA into production.py and single_shot - Add dsv4_attention_mixed_fp8_prefill to production.py - _run_production_fmha_mixed now dispatches to prefill kernel for T>1 - Remove decode-only T==1 restriction - Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted	2026-06-03 03:49:57 +00:00
biondizzle	5417f65b08	CRITICAL FIX: Add T-dimension strides to prefill FMHA kernel The kernel was using head strides for the T (query row) dimension, which happened to work for T=1 (qr=0 always) but was wrong for T>1. For (B,H,T,NOPE) layout: - Head stride = TNOPE, but T stride = NOPE - Scale head stride = T, but T stride = 1 - RoPE head stride = TROPE, but T stride = ROPE Added q_nope_t_stride, q_scale_t_stride, q_rope_t_stride to params struct, C API, and Python wrapper.	2026-06-03 03:48:17 +00:00
biondizzle	223c22488f	Simplify prefill PV read: use decode kernel's exact pattern Replace complex n_sub-iterating read with the same HD/8 iteration pattern as the proven decode kernel. Extract from lane qr%32 instead of always lane 0. For qr>=32, use warp 1; for qr>=64, add TMEM offset. This should fix the row 1 accuracy issue (was cos=0.94 vs decode).	2026-06-03 03:22:49 +00:00
biondizzle	eb69c3bfb9	CRITICAL FIX: add missing tb base in QK TMEM read address prefill_read_qk_rows was reading from address 0 (sg_off + n * 8) instead of tb + sg_off + n * 8. This caused garbage QK values, explaining the 0.928 cosine for T=1 and NaN for T>1.	2026-06-03 03:00:57 +00:00
biondizzle	99b6de316b	Fix prefill kernel: add missing tb base in PV TMEM read, fix ACCUMULATE for per-row PV Two critical fixes: 1. prefill_read_pv_all_subs: was missing 'tb' base in TMEM read address 2. PV MMA ACCUMULATE: use pv_kt == 0 (not kv_tile==0 && pv_kt==0 && n_sub==0) so each query row's PV starts fresh instead of accumulating into previous row's result	2026-06-03 02:59:19 +00:00
biondizzle	9034f67b0f	Fix prefill kernel: read ALL n_sub PV results (was only n_sub=0) Critical bug: prefill_read_pv_row only read n_sub=0 (16 out of 512 HD dims). Replaced with prefill_read_pv_all_subs that iterates over all 32 n_sub groups. Also fixed TMEM row-group/warp mapping for rows 32-127.	2026-06-03 02:54:59 +00:00

1 2 3 4 5 ...

686 Commits