nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	55def5eef9	Restore A/B split + gsa scalar fix (error is pre-existing, not regression)	2026-06-04 01:03:36 +00:00
biondizzle	59eccd04ab	REVERT: test if cudaErrorInvalidValue is pre-existing or regression	2026-06-04 00:53:09 +00:00
biondizzle	5e3ced0b60	DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path	2026-06-04 00:41:28 +00:00
biondizzle	b314fde9b7	Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment The pattern causes cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view (e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1). Root cause: copy_() from an expanded/reshaped view can fail when the source tensor has non-standard strides. The expand() operation creates a view with stride-0 dimensions that copy_() may not handle correctly on all CUDA versions. Fix: Replace all gsa copy_ patterns with scalar assignment: self._gsa_buf[0] = gsa_gpu[0] # scalar GPU→GPU, graph-capturable This is simpler, avoids view issues, and is CUDA-graph-compatible. Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py	2026-06-04 00:30:21 +00:00
biondizzle	993bb345d1	DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug	2026-06-04 00:15:38 +00:00
biondizzle	f0f87df906	DEBUG: add sync + shape prints to shared_expert L2 gsa copy	2026-06-04 00:05:08 +00:00
biondizzle	1d6610c46d	CUDA graph A/B split: eager-break-at-attention architecture CUDAGraphDecoder now splits each layer into two graph-captured regions with eager attention in between: Graph A (pre-attention): mHC pre_block + fused RMSNorm + quantize + q_a/q_b/kv projections → writes intermediates to pre-allocated buffers Eager (attention): Compressor → Indexer → FMHA → o_proj → dynamic shapes, data-dependent control flow Graph B (post-attention): mHC post_block + FFN + Router + MoE + SE → writes X_next to pre-allocated output buffer The attention path has dynamic shapes (FMHA seq_len grows, compressor returns None) and cannot be captured. The compute path has fixed shapes for T=1 decode and CAN be captured. Changes: - CUDAGraphDecoder: 2 graphs per layer (A/B) + lm_head graph - Pre-allocated intermediate buffers for graph A → eager → graph B boundary - forward_attention: accepts optional q_heads/kv_3d to skip projections - Replay loop: graph A → eager attention → graph B per layer This replaces the single-graph-per-layer approach which failed at L1+ because the attention path contains data-dependent control flow and dynamic shapes that cannot be captured.	2026-06-03 23:53:08 +00:00
biondizzle	800e974d20	Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress - Category 6: Per-step allocations (partially fixed, 6 done, ~6 blocking) - Category 7: CuTeDSL from_dlpack fix (v3 works, v1/v2 failed) - Category 8: Cross-GPU operations in graph capture (fixed) - CUDAGraphDecoder architecture: single-graph-per-layer (simplified from A/B split) - Multi-layer capture still blocked by Category 6 allocations	2026-06-03 23:41:42 +00:00
biondizzle	a468f72a0e	CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert Pass out= parameter to run_fused_swiglu_grouped_gemm to avoid per-step torch.zeros() allocation during CUDA graph capture.	2026-06-03 23:17:43 +00:00
biondizzle	56b816a54f	CUDA graph: Use per-GPU position/token buffers for graph capture Cross-GPU .to() calls inside graph capture cause 'dependency on uncaptured work in another stream'. Fix: pass dec_pos_per_gpu/dec_tid32_per_gpu to capture() so each layer's graph uses buffers on its own GPU.	2026-06-03 22:56:20 +00:00
biondizzle	f57de06eb5	Fix grouped_linear GEMM output buffer shape and extraction - _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output - Extraction: groups are stacked vertically, not horizontally - Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns	2026-06-03 22:26:40 +00:00
biondizzle	92225b07e7	CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split) The A/B split approach was too complex: it required splitting forward_layer, handling the eager FMHA section, and fixing per-GPU buffer issues. The simpler approach captures the entire forward_layer as one graph per layer, just like the detector test did for L0. This works because: - FMHA pads KV to 128 → fixed shape for graph capture - Compressor returns None on non-boundary steps → graph captures the path taken during warmup (typically the None path for HCA r=128) - All sync violations were already fixed in previous commits The capture still uses dec_pos_buf/dec_tid32_buf on cuda:0 (forward_layer handles device transfer internally).	2026-06-03 22:04:18 +00:00
biondizzle	b32713c302	grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture) Add _output_buf_padded for the flat GEMM output, pass as out= parameter to run_nvfp4_grouped_gemm to avoid per-step torch.zeros() allocation.	2026-06-03 22:02:01 +00:00
biondizzle	676fad064f	Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature	2026-06-03 21:45:15 +00:00
biondizzle	188ecae47f	CUDA graph: Eliminate per-step allocations in graph-captured code paths - gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers - quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in torch.where() calls (graph-capturable, no memory allocation) - Both fixes prevent 'Disallowed operation during CUDA stream capture' errors during graph capture	2026-06-03 21:30:24 +00:00
biondizzle	91c370360a	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3) Patch torch.cuda.current_device to return the tensor's device index during from_dlpack calls inside CUDA graph capture. This bypasses the device check in __dlpack__ without changing the CUDA stream (which caused 'Capture must end on the same stream' in v1) and without triggering a cross-device copy (which caused 'Cannot copy between CPU and CUDA tensors' in v2).	2026-06-03 21:09:12 +00:00
biondizzle	5c94dbbc37	Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2) Previous fix (set_device) caused 'Capture must end on the same stream'. New fix: wrap tensor in _DLPatchTensor during graph capture, which forces dl_device in __dlpack__ to bypass the device check without changing the stream. This enables CUDA graph capture on all 8 GPUs, not just cuda:0.	2026-06-03 20:54:18 +00:00
biondizzle	87b6c9932b	Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture When capturing CUDA graphs on non-default GPUs, torch.cuda.current_device() may not match the tensor's device. from_dlpack() checks this and fails. Fix: set the current device to match the tensor's device before from_dlpack. This enables graph capture on all 8 GPUs, not just cuda:0.	2026-06-03 20:34:24 +00:00
biondizzle	2661cebe9a	Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa)	2026-06-03 19:49:54 +00:00
biondizzle	486f74d900	CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split Architecture: - Sub-graph A (per layer): mHC pre + fused rmsnorm/quantize + Q/KV projections + RoPE - Eager section: KV append + Compressor + Indexer + KV gather + FMHA + Inverse RoPE - Sub-graph B (per layer): o_proj + mHC post(attn) + mHC pre(FFN) + fused rmsnorm/quantize + Router + MoE + SE + mHC post(FFN) - lm_head graph on cuda:0 Key features: - Per-GPU token/position buffers (avoids cross-device .to() inside graphs) - Pre-allocated I/O buffers with fixed addresses for graph capture - Uses fused P5 rmsnorm+quantize path inside graphs (production path) - Captures after step 0 warmup (after CuTeDSL compile + gsa fix) - Eager path unchanged for warmup and --no-cuda-graph runs - eager_attention() extracted from forward_attention() for graph replay path Wires --cuda-graph flag into main() decode loop.	2026-06-03 19:24:26 +00:00
biondizzle	5ea3aa3406	Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md - L0 CUDA graph capture PASSES on B200 - All compute-forward sync violations fixed - 3/5 Section C hazards done, 2 deferred to Phase 2 - Full violation fix log with commits - Next steps: extend to all 61 layers + replay verification	2026-06-03 19:15:27 +00:00
biondizzle	80bb27f5bf	CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode The stride-0 expand view for gsa_gpu caused illegal memory access in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle stride-0 tensors correctly. Fix: - M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc - M>1 prefill (not graph-captured): expand + contiguous — allocation OK	2026-06-03 18:08:18 +00:00
biondizzle	518a1d3f95	CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount 1. scatter_add_ requires int64 indices — ensure sorted_ids is .long() 2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern 3. Both code paths now use pre-allocated _tokens_per_expert_buf	2026-06-03 17:53:40 +00:00
biondizzle	f13a81d48b	CUDA graph: Fix per-call allocations in grouped_linear and quantize 1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle - Same fix as linear.py — avoids torch.zeros per call - Uses correctly-sized view for pad_and_swizzle_single 2. quantize.py: Replace torch.zeros_like with scalar 0.0 - torch.zeros_like allocates a full tensor every call - torch.where(cond, 0.0, x) broadcasts scalar — no allocation	2026-06-03 17:39:20 +00:00
biondizzle	84655d066a	CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4 ) 1. Replace torch.bincount with scatter_add_ into pre-allocated buffer - bincount produces data-dependent shapes → breaks graph capture - scatter_add_ with pre-allocated _tokens_per_expert_buf (fixed shape) - Pre-allocated _ones_buf to avoid per-call torch.ones() 2. Replace torch.full for l1_gsa with pre-allocated buffer + fill_ - torch.full allocates every call → breaks graph capture - Use self._l1_gsa_buf.fill_(l1_gs) instead	2026-06-03 17:37:03 +00:00
biondizzle	df05289d6f	CUDA graph: Fix remaining sync violations from B200 detector run 2 1. grouped_linear.py: Remove conditional host read of GPU tensor - 'if group_offsets[0] != 0' reads GPU value on host → sync - Fix: unconditionally update offsets every call (GPU-only multiply) 2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer - dec_tid_buf[0] = python_int → CPU→GPU sync - Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable) 3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)	2026-06-03 17:20:34 +00:00
biondizzle	e07d79868f	CUDA graph: Fix _assemble_scales_single_group swizzle size The pre-allocated buffer is max-sized, but pad_and_swizzle_single operates on the full buffer dimensions. Fix: pass a correctly-sized view (buf[:padded_rows, :padded_cols]) so the swizzle produces the right output size. Same fix applied to both linear.py and shared_expert.py.	2026-06-03 17:02:34 +00:00
biondizzle	0ca7bed0e1	CUDA graph: Fix sync violations found by B200 detector Fixes from running Section A detector on B200: 1. single_shot_inference.py: Use pinned CPU buffers for token/position transfer - dec_tid_buf[0] = python_int causes CPU→GPU sync - Fixed: write to pinned CPU buffer, then copy_ (async, graph-capturable) 2. grouped_linear.py: Fix expert_offsets Python loop - expert_offsets[g] = python_int * padded_rows → CPU→GPU sync per iteration - Fixed: element-wise multiply with pre-allocated range tensor (GPU-only) 3. grouped_linear.py: Vectorized output extraction for T=1 decode - Python loop z[:, g, :] = out[...] → CPU sync for each slice - Fixed: GPU gather with pre-computed indices for T=1 4. grouped_linear.py: Pre-allocate output buffer - torch.empty() per call → allocation inside graph - Fixed: use self._output_buf (pre-allocated at max size) 5. grouped_linear.py: Pre-allocate expert_offsets_range_buf - torch.arange() per call → allocation inside graph - Fixed: compute once at init, reuse via element-wise multiply	2026-06-03 16:52:19 +00:00
biondizzle	46a3a51832	CUDA graph: Fix per-step allocations in decode loop 1. mHCLayer.init_state: Add out_buf parameter for in-place write - Pre-allocated dec_X_buf (1, 4, 7168) on cuda:0 - Eliminates .unsqueeze().expand().clone() allocation each step 2. single_shot_inference.py: Pre-allocate dec_embed_buf - Placeholder for embedding output (graph capture will use this) 3. Note: Cross-GPU X.to() transfers still allocate per step - This requires per-GPU X buffers (part of graph capture architecture)	2026-06-03 16:38:35 +00:00
biondizzle	a9ea30353c	CUDA graph: Fix sync violations (Category 1-2) 1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated) - The X_next.abs().max().item() was syncing EVERY layer's post_block - Diagnostics moved to caller (outside graph region) 2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size - _assemble_scales_single_group now uses pre-allocated buffer - Eliminates per-call torch.zeros() allocation (graph capture killer) 3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf - _assemble_scales_single_group no longer allocates 4. quantize.py: Remove .contiguous() from gsa expand - expand() creates stride-0 view, CUDA kernel reads correctly - No allocation on the hot path 5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog	2026-06-03 16:37:20 +00:00
biondizzle	caac8ae108	Fix syntax error: 'is not not None' -> 'is not None'	2026-06-03 16:34:33 +00:00
biondizzle	ba68212fa7	Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md) - Grep for Section B sync patterns in hot path files - Method 1: run decode forward with torch.cuda.set_sync_debug_mode('error') - Method 2: attempt CUDA graph capture of L0 decode step - Full model load + prefill + warmup before detection - Results saved to /tmp/cuda_graph_readiness_results.json	2026-06-03 16:34:15 +00:00
biondizzle	ca5bc814d5	Fix compressor: do not add positional bias to KV content The positional bias (ape/B) should only modulate the compression softmax logits (Z + B), NOT be added to the KV content itself. Paper equation: compressed = softmax(Z + B) · C Bug was doing: compressed = softmax(Z + B) · (C + B) — poisons every compressed KV entry with learned positional-bias content. Fixed in both CSA (compress_csa_reduce_kernel) and HCA (hca_compress_reduce_kernel) paths in compressor_reduce.cu.	2026-06-03 15:52:00 +00:00
biondizzle	4fe73fe713	auto: pre-test commit v-precision-floor-fix-20260603	2026-06-03 15:45:15 +00:00
biondizzle	f577ed97f4	Fix: Use PyTorch dequant_nvfp4 for weight dequantization (compressor/indexer/router gate) The CUDA dequantize_nvfp4 (dsv4/ops/quantize.py) was designed for activations/KV and assumes row-major (M, N/16) scale layout. Using it for weight dequantization caused async illegal memory access because weight scales don't match the kernel's expected layout. The kernel only validates row count, not width or contiguity. All 4 call sites now use the PyTorch dequant_nvfp4 (defined in single_shot_inference.py) which handles weight_scale_2 and input_scale correctly and cannot cause OOB access: - Compressor.load: kv_proj, gate_proj - Indexer.load: weights_proj - Router gate dequantization in main()	2026-06-03 14:57:40 +00:00
biondizzle	1121cd7b47	Add CUDA_LAUNCH_BLOCKING=1 to catch async errors	2026-06-03 14:48:51 +00:00
biondizzle	f3bb0ca08c	Fix dequant gsa: use ws2 only, NOT input_scale * ws2 For weight dequantization, gsa should be weight_scale_2 only. input_scale is the activation global scale — it belongs on the GEMM's activation side, not the weight side. Using input_scale * ws2 gave gsa = 6e-8 (essentially zero), making dequantized weights ~0. The GEMM formula is y = (x * scale_a * gsa) @ (w * scale_b * gsb) where gsb = input_scale * ws2. But dequantize_nvfp4 is just the weight half: w_bf16 = lut[w] * block_scale * ws2.	2026-06-03 14:38:24 +00:00
biondizzle	470e65fb19	Fix dequant gsb: input_scale * ws2, not 1.0 * ws2 The NVFP4 dequantize formula is w = lut[w_packed] * scale * ws2, and in the GEMM the global_scale_b = input_scale * ws2. Was incorrectly using gsb = 1.0 * ws2 (missing input_scale). This would produce wrongly-scaled BF16 weights from dequantize_nvfp4.	2026-06-03 14:26:59 +00:00
biondizzle	2dd16d5789	Switch compressor + indexer weights_proj to BF16 F.linear Only the CSA indexer QK path (q_b_proj) is explicitly FP4-QATed. The rest of the compressor/indexer projections are NOT, so use BF16: - Compressor kv_proj, gate_proj: dequantize NVFP4 → BF16, F.linear - Indexer weights_proj: dequantize NVFP4 → BF16, F.linear - Indexer q_b_proj: KEEP as NVFP4 (this IS the FP4-QATed path) - Indexer compressor: inherits Compressor's BF16 path	2026-06-03 14:19:41 +00:00
biondizzle	95e45a87e3	Add explicit .to(dev) on W_gate after transpose — belt and suspenders	2026-06-03 14:17:02 +00:00
biondizzle	ef94c48957	Simplify router gate: dequant NVFP4 → BF16, F.linear (no FP8 middleman) Same as what worked before. The checkpoint stores NVFP4 weights, so we dequantize once at load time and use cuBLAS F.linear. No FP8 re-quantize step needed — that was just adding noise on top of the NVFP4 dequant.	2026-06-03 14:14:10 +00:00
biondizzle	715602c87c	Switch lm_head to BF16 + router gate to FP8_E4M3 lm_head: BF16 F.linear (checkpoint weight is BF16, no quantization) Router gate: FP8_E4M3 quantize→dequantize round-trip, then F.linear - Dequantize NVFP4 checkpoint weights to BF16 first - Quantize to FP8_E4M3 (scale = amax/448) - Dequantize back to BF16 for F.linear - Uses BF16 dispatch path in dense_router_dispatch - Simpler scale wiring than NVFP4 (single per-tensor scale)	2026-06-03 14:10:28 +00:00
biondizzle	7901470e63	doc clean up v-official-encoding-path	2026-06-03 10:53:41 +00:00
biondizzle	ca7c309463	Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference - reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only) - reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style ) - reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant) - reference/official_inference/ — original weight's generate.py, model.py, kernel.py - reference/README.md documents the layout and which files matter for our pipeline - These are read-only references for cross-checking, not imported by production code	2026-06-03 10:25:23 +00:00
biondizzle	8cfc1cae58	Canonical encoding: derive special token IDs from official encoding module + tokenizer - Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs - Import token strings from encoding.deepseek_v4_encoding (official source) - Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime - Use parse_message_from_completion_text() for structured output parsing - No more hand-rolled prompt construction or hardcoded token IDs - Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference	2026-06-03 10:23:02 +00:00
biondizzle	a86d6d90a5	Replace hand-rolled prompt with official DSV4 encoder (canonical path) - Copied deepseek_v4_encoding.py from vLLM tree to encoding/ - Replaced hand-rolled prompt construction with encode_messages() - --chat-mode → --thinking-mode (thinking\|chat) - The official encoder handles: BOS, User/Assistant tokens, thinking mode, tool calls, and all special token placement. It can't drift. - This is the same code path inference engines will use.	2026-06-03 09:59:05 +00:00
biondizzle	284fc9ca86	Fix: thread comp_rope_cos/comp_rope_sin through forward_attention Previous commit added params to forward_layer but forward_attention (where compressed RoPE is applied) didn't receive them, causing NameError. Also confirmed from B200 test output: compress_rope_theta=160000 vs rope_theta=10000 — a 16x difference. The separate cache is essential.	2026-06-03 09:30:57 +00:00
biondizzle	6a3374da18	Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through - Fixed comp_pos: (bir) block-aligned instead of ((bi+1)r-1) last-position - compress_rope_theta: separate rope cache for compressed KV entries - comp_rope_cos/comp_rope_sin wired to all forward_layer call sites (prefill chunk loop, decode loop, CUDAGraphDecoder capture) - forward_layer uses comp_rope caches for compressed RoPE, falls back to normal - Only single_shot_inference.py modified, no kernel code touched	2026-06-03 09:19:11 +00:00
biondizzle	5003e756e2	WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support - CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)r-1)) instead of FIRST position (bir). Off by r-1: 3 for CSA, 127 for HCA. vLLM uses (position // ratio) * ratio = block-aligned first position. - Added compress_rope_theta config support (vLLM uses separate theta for compressed) - Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through) Only single_shot_inference.py changed — no kernel code touched. Base commit: `572bdd2`	2026-06-03 09:17:54 +00:00
biondizzle	572bdd2840	auto: pre-test commit	2026-06-03 09:01:02 +00:00

1 2 3 4 5 ...

2366 Commits