nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	ac213bdee8	Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)	2026-06-06 08:29:40 +00:00
biondizzle	6650f06121	CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug	2026-06-06 08:18:18 +00:00
biondizzle	90ac38cde0	Add CUDA graph stream management test	2026-06-06 08:14:29 +00:00
biondizzle	26042e3f01	Add minimal CUDA graph multi-GPU test to isolate zero-output bug	2026-06-06 08:13:18 +00:00
biondizzle	86275851d4	Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue	2026-06-06 08:02:35 +00:00
biondizzle	2cbf7a43e9	Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification	2026-06-06 07:51:22 +00:00
biondizzle	2bb52c7cae	Add per-layer graph capture verification — replay immediately and check for zeros	2026-06-06 07:40:19 +00:00
biondizzle	5a98cc6d90	Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug	2026-06-06 07:29:33 +00:00
biondizzle	dcb2495a5b	Add graph replay debug prints for first 3 steps/layers	2026-06-06 07:19:07 +00:00
biondizzle	16b9a4def2	Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay	2026-06-06 07:18:49 +00:00
biondizzle	f259d63930	CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path	2026-06-06 07:01:52 +00:00
biondizzle	32902d1036	CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se	2026-06-06 07:01:12 +00:00
biondizzle	64f547058e	Fix graph replay: pass q_a from Graph A output to forward_attention - q_a is needed by the indexer in CSA layers - When q_heads/kv_3d are provided (graph replay), the projection code is skipped so q_a is never computed - Fix: add q_a_bufs to CUDAGraphDecoder, write q_a during Graph A capture, pass q_a as kwarg to forward_attention during graph replay - Also: forward_attention now accepts q_a kwarg (default None)	2026-06-04 08:09:30 +00:00
biondizzle	26da6d33af	Fix graph replay: remove extra token_id arg from forward_attention call The forward_attention() signature has no token_id parameter, but the graph replay path was passing dec_tid32_per_gpu[gpu] between positions and compressor — causing the int tensor to be interpreted as compressor and triggering AttributeError: 'int' object has no attribute 'ratio'	2026-06-04 06:10:02 +00:00
biondizzle	ae26f6b83c	Fix dense router BF16 dispatch: use torch.matmul instead of F.linear - F.linear(x, W) computes x @ W.T which caused shape mismatch when W_gate was pre-transposed to [E, H] - Use torch.matmul(x, W_gate) instead — computes x @ W directly, no transpose needed, no FP32 conversion, fully graph-capturable - W_gate stays as [H, E] (original checkpoint shape)	2026-06-04 05:58:24 +00:00
biondizzle	e46b615873	Fix dense router BF16 dispatch for CUDA graph capture - Run GEMM in BF16 (not FP32) during graph capture — Blackwell tensor cores handle BF16 natively; FP32 GEMM triggers cudaErrorStreamCaptureUnsupported - Pre-transpose W_gate to [E, H] at load time — avoids .T view during capture - Convert only logits output to FP32 for sqrt(softplus) numerical stability - This fixes the graph capture failure at layer 0 Graph B	2026-06-04 05:50:13 +00:00
biondizzle	b4a59d0940	Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers GETTING_CUDAGRAPH_READY.md: - Updated architecture section for A/B split (Graph A + eager attention + Graph B) - Updated Section D integration order with current progress - Added all recent violation fix commits CUDA_GRAPH_SYNC_INVENTORY.md: - Added Category 6 fixes: _l1_out_buf 2x fix, GEMM output pre-allocation, swizzle CUDA kernel, gsa scalar assignment, router BF16 fix - Added remaining blockers for next session - Updated CUDAGraphDecoder architecture description for A/B split - Added capture/replay flow description	2026-06-04 05:13:51 +00:00
biondizzle	ffa7842b58	Fix dense router: run GEMM in BF16, convert to FP32 only for activation hidden_states.float() and gate_bf16.T.float() create new FP32 tensors during CUDA graph capture, which is not graph-capturable. Fix: run the linear in BF16 (Blackwell tensor cores handle BF16 natively), then convert only the output logits to FP32 for numerical stability in sqrt(softplus). The single logits.float() is graph-capturable because it's a unary op with a pre-existing output buffer.	2026-06-04 04:49:08 +00:00
biondizzle	119e6d471e	Add safety check for swizzled buffers: fall through to Python path if None	2026-06-04 04:32:00 +00:00
biondizzle	fae61d3ef7	Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream	2026-06-04 04:13:40 +00:00
biondizzle	ee86969f6c	Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch	2026-06-04 03:57:59 +00:00
biondizzle	e26c28a1ce	Fix CUDA stream API: getCurrentCUDAStream().stream()	2026-06-04 03:43:04 +00:00
biondizzle	9b3917e248	Fix blackwell_swizzle.cu: add pybind11 bindings for torch extension loader	2026-06-04 03:29:10 +00:00
biondizzle	5487a58df4	Fix NameError: add rows/cols variables to MoE swizzle	2026-06-04 03:14:27 +00:00
biondizzle	a434545d12	Blackwell swizzle CUDA kernel for CUDA graph capture Python view operations (reshape, transpose, permute) are not graph-capturable — they cause cudaErrorStreamCaptureUnsupported. Added: - dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle - to_blocked(): detects graph capture, uses CUDA kernel instead of Python views - MoE _assemble_scales_cudagraph_safe: same treatment - Shared expert _assemble_scales_single_group: same treatment - Linear _assemble_scales_single_group: same treatment - Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like) The CUDA kernel writes to a pre-allocated buffer — no per-step allocations. Eager path unchanged (still uses fast Python view operations).	2026-06-04 03:03:02 +00:00
biondizzle	e7766254b7	Pre-allocate ALL GEMM output buffers for CUDA graph capture Every run_nvfp4_grouped_gemm call must pass out= with a pre-allocated buffer. During CUDA graph capture, torch.zeros() allocations are forbidden — they cause 'cudaErrorStreamCaptureUnsupported' errors. Added: - shared_expert: _l2_out_buf for L2 GEMM - shared_expert: pass out= for both L1 and L2 GEMM calls - moe: _l2_out_buf for L2 GEMM - moe: pass out= for unfused L1 GEMM (fused L1 already had it) - moe: pass out= for L2 GEMM - linear: _gemm_out_buf for all GEMM calls - linear: pass out= for both run() and run_from_quantized() paths grouped_linear already had _output_buf_padded — no changes needed.	2026-06-04 02:41:59 +00:00
biondizzle	676a0448c0	CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption The L1 GEMM produces gate+up combined output with 2intermediate_size BF16 columns, but _l1_out_buf was only allocated with intermediate_size columns. The GEMM wrote past the buffer boundary, corrupting GPU memory and causing cudaErrorInvalidValue on subsequent operations. This was the root cause of ALL the cudaErrorInvalidValue errors in the shared expert and MoE L2 paths — the corrupted memory from the L1 buffer overflow propagated downstream. Fix: _l1_out_buf shape (max_rows, 2intermediate_size) instead of (max_rows, intermediate_size). Applied to both shared_expert.py and moe.py. Also removed all DEBUG sync/print statements from quantize.py and shared_expert.py — the bug was not in the quantize kernels, it was the buffer overflow.	2026-06-04 02:06:18 +00:00
biondizzle	0890e578f4	DEBUG: print l1_out shape before gate/up split	2026-06-04 01:49:12 +00:00
biondizzle	8546ed725f	DEBUG: check SE input magnitude	2026-06-04 01:38:24 +00:00
biondizzle	26ecf96328	DEBUG: check intermediate magnitude before SE L2	2026-06-04 01:30:29 +00:00
biondizzle	5303d6a82f	DEBUG: test copy_ with contiguous slice vs scalar assign for gsa	2026-06-04 01:27:25 +00:00
biondizzle	ccbc713658	DEBUG: check gsa values and pinpoint exact failing operation	2026-06-04 01:16:37 +00:00
biondizzle	e77455c3ba	DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors	2026-06-04 01:05:47 +00:00
biondizzle	55def5eef9	Restore A/B split + gsa scalar fix (error is pre-existing, not regression)	2026-06-04 01:03:36 +00:00
biondizzle	59eccd04ab	REVERT: test if cudaErrorInvalidValue is pre-existing or regression	2026-06-04 00:53:09 +00:00
biondizzle	5e3ced0b60	DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path	2026-06-04 00:41:28 +00:00
biondizzle	b314fde9b7	Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment The pattern causes cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view (e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1). Root cause: copy_() from an expanded/reshaped view can fail when the source tensor has non-standard strides. The expand() operation creates a view with stride-0 dimensions that copy_() may not handle correctly on all CUDA versions. Fix: Replace all gsa copy_ patterns with scalar assignment: self._gsa_buf[0] = gsa_gpu[0] # scalar GPU→GPU, graph-capturable This is simpler, avoids view issues, and is CUDA-graph-compatible. Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py	2026-06-04 00:30:21 +00:00
biondizzle	993bb345d1	DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug	2026-06-04 00:15:38 +00:00
biondizzle	f0f87df906	DEBUG: add sync + shape prints to shared_expert L2 gsa copy	2026-06-04 00:05:08 +00:00
biondizzle	1d6610c46d	CUDA graph A/B split: eager-break-at-attention architecture CUDAGraphDecoder now splits each layer into two graph-captured regions with eager attention in between: Graph A (pre-attention): mHC pre_block + fused RMSNorm + quantize + q_a/q_b/kv projections → writes intermediates to pre-allocated buffers Eager (attention): Compressor → Indexer → FMHA → o_proj → dynamic shapes, data-dependent control flow Graph B (post-attention): mHC post_block + FFN + Router + MoE + SE → writes X_next to pre-allocated output buffer The attention path has dynamic shapes (FMHA seq_len grows, compressor returns None) and cannot be captured. The compute path has fixed shapes for T=1 decode and CAN be captured. Changes: - CUDAGraphDecoder: 2 graphs per layer (A/B) + lm_head graph - Pre-allocated intermediate buffers for graph A → eager → graph B boundary - forward_attention: accepts optional q_heads/kv_3d to skip projections - Replay loop: graph A → eager attention → graph B per layer This replaces the single-graph-per-layer approach which failed at L1+ because the attention path contains data-dependent control flow and dynamic shapes that cannot be captured.	2026-06-03 23:53:08 +00:00
biondizzle	800e974d20	Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress - Category 6: Per-step allocations (partially fixed, 6 done, ~6 blocking) - Category 7: CuTeDSL from_dlpack fix (v3 works, v1/v2 failed) - Category 8: Cross-GPU operations in graph capture (fixed) - CUDAGraphDecoder architecture: single-graph-per-layer (simplified from A/B split) - Multi-layer capture still blocked by Category 6 allocations	2026-06-03 23:41:42 +00:00
biondizzle	a468f72a0e	CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert Pass out= parameter to run_fused_swiglu_grouped_gemm to avoid per-step torch.zeros() allocation during CUDA graph capture.	2026-06-03 23:17:43 +00:00
biondizzle	56b816a54f	CUDA graph: Use per-GPU position/token buffers for graph capture Cross-GPU .to() calls inside graph capture cause 'dependency on uncaptured work in another stream'. Fix: pass dec_pos_per_gpu/dec_tid32_per_gpu to capture() so each layer's graph uses buffers on its own GPU.	2026-06-03 22:56:20 +00:00
biondizzle	f57de06eb5	Fix grouped_linear GEMM output buffer shape and extraction - _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output - Extraction: groups are stacked vertically, not horizontally - Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns	2026-06-03 22:26:40 +00:00
biondizzle	92225b07e7	CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split) The A/B split approach was too complex: it required splitting forward_layer, handling the eager FMHA section, and fixing per-GPU buffer issues. The simpler approach captures the entire forward_layer as one graph per layer, just like the detector test did for L0. This works because: - FMHA pads KV to 128 → fixed shape for graph capture - Compressor returns None on non-boundary steps → graph captures the path taken during warmup (typically the None path for HCA r=128) - All sync violations were already fixed in previous commits The capture still uses dec_pos_buf/dec_tid32_buf on cuda:0 (forward_layer handles device transfer internally).	2026-06-03 22:04:18 +00:00
biondizzle	b32713c302	grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture) Add _output_buf_padded for the flat GEMM output, pass as out= parameter to run_nvfp4_grouped_gemm to avoid per-step torch.zeros() allocation.	2026-06-03 22:02:01 +00:00
biondizzle	676fad064f	Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature	2026-06-03 21:45:15 +00:00
biondizzle	188ecae47f	CUDA graph: Eliminate per-step allocations in graph-captured code paths - gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers - quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in torch.where() calls (graph-capturable, no memory allocation) - Both fixes prevent 'Disallowed operation during CUDA stream capture' errors during graph capture	2026-06-03 21:30:24 +00:00
biondizzle	91c370360a	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3) Patch torch.cuda.current_device to return the tensor's device index during from_dlpack calls inside CUDA graph capture. This bypasses the device check in __dlpack__ without changing the CUDA stream (which caused 'Capture must end on the same stream' in v1) and without triggering a cross-device copy (which caused 'Cannot copy between CPU and CUDA tensors' in v2).	2026-06-03 21:09:12 +00:00
biondizzle	5c94dbbc37	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2) Previous fix (set_device) caused 'Capture must end on the same stream'. New fix: wrap tensor in _DLPatchTensor during graph capture, which forces dl_device in __dlpack__ to bypass the device check without changing the stream. This enables CUDA graph capture on all 8 GPUs, not just cuda:0.	2026-06-03 20:54:18 +00:00

1 2 3 4 5 ...

2399 Commits