Commit Graph

2376 Commits

Author SHA1 Message Date
5487a58df4 Fix NameError: add rows/cols variables to MoE swizzle 2026-06-04 03:14:27 +00:00
a434545d12 Blackwell swizzle CUDA kernel for CUDA graph capture
Python view operations (reshape, transpose, permute) are not
graph-capturable — they cause cudaErrorStreamCaptureUnsupported.

Added:
- dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle
- to_blocked(): detects graph capture, uses CUDA kernel instead of Python views
- MoE _assemble_scales_cudagraph_safe: same treatment
- Shared expert _assemble_scales_single_group: same treatment
- Linear _assemble_scales_single_group: same treatment
- Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like)

The CUDA kernel writes to a pre-allocated buffer — no per-step allocations.
Eager path unchanged (still uses fast Python view operations).
2026-06-04 03:03:02 +00:00
e7766254b7 Pre-allocate ALL GEMM output buffers for CUDA graph capture
Every run_nvfp4_grouped_gemm call must pass out= with a pre-allocated
buffer. During CUDA graph capture, torch.zeros() allocations are
forbidden — they cause 'cudaErrorStreamCaptureUnsupported' errors.

Added:
- shared_expert: _l2_out_buf for L2 GEMM
- shared_expert: pass out= for both L1 and L2 GEMM calls
- moe: _l2_out_buf for L2 GEMM
- moe: pass out= for unfused L1 GEMM (fused L1 already had it)
- moe: pass out= for L2 GEMM
- linear: _gemm_out_buf for all GEMM calls
- linear: pass out= for both run() and run_from_quantized() paths

grouped_linear already had _output_buf_padded — no changes needed.
2026-06-04 02:41:59 +00:00
676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption
The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.

This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.

Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.

Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.
2026-06-04 02:06:18 +00:00
0890e578f4 DEBUG: print l1_out shape before gate/up split 2026-06-04 01:49:12 +00:00
8546ed725f DEBUG: check SE input magnitude 2026-06-04 01:38:24 +00:00
26ecf96328 DEBUG: check intermediate magnitude before SE L2 2026-06-04 01:30:29 +00:00
5303d6a82f DEBUG: test copy_ with contiguous slice vs scalar assign for gsa 2026-06-04 01:27:25 +00:00
ccbc713658 DEBUG: check gsa values and pinpoint exact failing operation 2026-06-04 01:16:37 +00:00
e77455c3ba DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors 2026-06-04 01:05:47 +00:00
55def5eef9 Restore A/B split + gsa scalar fix (error is pre-existing, not regression) 2026-06-04 01:03:36 +00:00
59eccd04ab REVERT: test if cudaErrorInvalidValue is pre-existing or regression 2026-06-04 00:53:09 +00:00
5e3ced0b60 DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path 2026-06-04 00:41:28 +00:00
b314fde9b7 Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment
The pattern  causes
cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view
(e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1).

Root cause: copy_() from an expanded/reshaped view can fail when the
source tensor has non-standard strides. The expand() operation creates
a view with stride-0 dimensions that copy_() may not handle correctly
on all CUDA versions.

Fix: Replace all gsa copy_ patterns with scalar assignment:
  self._gsa_buf[0] = gsa_gpu[0]  # scalar GPU→GPU, graph-capturable

This is simpler, avoids view issues, and is CUDA-graph-compatible.
Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py
2026-06-04 00:30:21 +00:00
993bb345d1 DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug 2026-06-04 00:15:38 +00:00
f0f87df906 DEBUG: add sync + shape prints to shared_expert L2 gsa copy 2026-06-04 00:05:08 +00:00
1d6610c46d CUDA graph A/B split: eager-break-at-attention architecture
CUDAGraphDecoder now splits each layer into two graph-captured regions
with eager attention in between:

  Graph A (pre-attention):  mHC pre_block + fused RMSNorm + quantize
                              + q_a/q_b/kv projections
                              → writes intermediates to pre-allocated buffers
  Eager (attention):          Compressor → Indexer → FMHA → o_proj
                              → dynamic shapes, data-dependent control flow
  Graph B (post-attention):   mHC post_block + FFN + Router + MoE + SE
                              → writes X_next to pre-allocated output buffer

The attention path has dynamic shapes (FMHA seq_len grows, compressor
returns None) and cannot be captured. The compute path has fixed shapes
for T=1 decode and CAN be captured.

Changes:
- CUDAGraphDecoder: 2 graphs per layer (A/B) + lm_head graph
- Pre-allocated intermediate buffers for graph A → eager → graph B boundary
- forward_attention: accepts optional q_heads/kv_3d to skip projections
- Replay loop: graph A → eager attention → graph B per layer

This replaces the single-graph-per-layer approach which failed at L1+
because the attention path contains data-dependent control flow and
dynamic shapes that cannot be captured.
2026-06-03 23:53:08 +00:00
800e974d20 Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress
- Category 6: Per-step allocations (partially fixed, 6 done, ~6 blocking)
- Category 7: CuTeDSL from_dlpack fix (v3 works, v1/v2 failed)
- Category 8: Cross-GPU operations in graph capture (fixed)
- CUDAGraphDecoder architecture: single-graph-per-layer (simplified from A/B split)
- Multi-layer capture still blocked by Category 6 allocations
2026-06-03 23:41:42 +00:00
a468f72a0e CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert
Pass out= parameter to run_fused_swiglu_grouped_gemm to avoid per-step
torch.zeros() allocation during CUDA graph capture.
2026-06-03 23:17:43 +00:00
56b816a54f CUDA graph: Use per-GPU position/token buffers for graph capture
Cross-GPU .to() calls inside graph capture cause 'dependency on uncaptured
work in another stream'. Fix: pass dec_pos_per_gpu/dec_tid32_per_gpu to
capture() so each layer's graph uses buffers on its own GPU.
2026-06-03 22:56:20 +00:00
f57de06eb5 Fix grouped_linear GEMM output buffer shape and extraction
- _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output
- Extraction: groups are stacked vertically, not horizontally
- Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns
2026-06-03 22:26:40 +00:00
92225b07e7 CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split)
The A/B split approach was too complex: it required splitting forward_layer,
handling the eager FMHA section, and fixing per-GPU buffer issues. The
simpler approach captures the entire forward_layer as one graph per layer,
just like the detector test did for L0.

This works because:
- FMHA pads KV to 128 → fixed shape for graph capture
- Compressor returns None on non-boundary steps → graph captures the path
  taken during warmup (typically the None path for HCA r=128)
- All sync violations were already fixed in previous commits

The capture still uses dec_pos_buf/dec_tid32_buf on cuda:0 (forward_layer
handles device transfer internally).
2026-06-03 22:04:18 +00:00
b32713c302 grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture)
Add _output_buf_padded for the flat GEMM output, pass as out= parameter
to run_nvfp4_grouped_gemm to avoid per-step torch.zeros() allocation.
2026-06-03 22:02:01 +00:00
676fad064f Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature 2026-06-03 21:45:15 +00:00
188ecae47f CUDA graph: Eliminate per-step allocations in graph-captured code paths
- gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and
  run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers
- quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in
  torch.where() calls (graph-capturable, no memory allocation)
- Both fixes prevent 'Disallowed operation during CUDA stream capture'
  errors during graph capture
2026-06-03 21:30:24 +00:00
91c370360a Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3)
Patch torch.cuda.current_device to return the tensor's device index
during from_dlpack calls inside CUDA graph capture. This bypasses the
device check in __dlpack__ without changing the CUDA stream (which
caused 'Capture must end on the same stream' in v1) and without
triggering a cross-device copy (which caused 'Cannot copy between
CPU and CUDA tensors' in v2).
2026-06-03 21:09:12 +00:00
5c94dbbc37 Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2)
Previous fix (set_device) caused 'Capture must end on the same stream'.
New fix: wrap tensor in _DLPatchTensor during graph capture, which forces
dl_device in __dlpack__ to bypass the device check without changing the stream.

This enables CUDA graph capture on all 8 GPUs, not just cuda:0.
2026-06-03 20:54:18 +00:00
87b6c9932b Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture
When capturing CUDA graphs on non-default GPUs, torch.cuda.current_device()
may not match the tensor's device. from_dlpack() checks this and fails.
Fix: set the current device to match the tensor's device before from_dlpack.

This enables graph capture on all 8 GPUs, not just cuda:0.
2026-06-03 20:34:24 +00:00
2661cebe9a Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa) 2026-06-03 19:49:54 +00:00
486f74d900 CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split
Architecture:
- Sub-graph A (per layer): mHC pre + fused rmsnorm/quantize + Q/KV projections + RoPE
- Eager section: KV append + Compressor + Indexer + KV gather + FMHA + Inverse RoPE
- Sub-graph B (per layer): o_proj + mHC post(attn) + mHC pre(FFN) + fused rmsnorm/quantize + Router + MoE + SE + mHC post(FFN)
- lm_head graph on cuda:0

Key features:
- Per-GPU token/position buffers (avoids cross-device .to() inside graphs)
- Pre-allocated I/O buffers with fixed addresses for graph capture
- Uses fused P5 rmsnorm+quantize path inside graphs (production path)
- Captures after step 0 warmup (after CuTeDSL compile + gsa fix)
- Eager path unchanged for warmup and --no-cuda-graph runs
- eager_attention() extracted from forward_attention() for graph replay path

Wires --cuda-graph flag into main() decode loop.
2026-06-03 19:24:26 +00:00
5ea3aa3406 Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md
- L0 CUDA graph capture PASSES on B200
- All compute-forward sync violations fixed
- 3/5 Section C hazards done, 2 deferred to Phase 2
- Full violation fix log with commits
- Next steps: extend to all 61 layers + replay verification
2026-06-03 19:15:27 +00:00
80bb27f5bf CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode
The stride-0 expand view for gsa_gpu caused illegal memory access
in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle
stride-0 tensors correctly.

Fix:
- M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc
- M>1 prefill (not graph-captured): expand + contiguous — allocation OK
2026-06-03 18:08:18 +00:00
518a1d3f95 CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount
1. scatter_add_ requires int64 indices — ensure sorted_ids is .long()
2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern
3. Both code paths now use pre-allocated _tokens_per_expert_buf
2026-06-03 17:53:40 +00:00
f13a81d48b CUDA graph: Fix per-call allocations in grouped_linear and quantize
1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle
   - Same fix as linear.py — avoids torch.zeros per call
   - Uses correctly-sized view for pad_and_swizzle_single

2. quantize.py: Replace torch.zeros_like with scalar 0.0
   - torch.zeros_like allocates a full tensor every call
   - torch.where(cond, 0.0, x) broadcasts scalar — no allocation
2026-06-03 17:39:20 +00:00
84655d066a CUDA graph: Fix MoE bincount and per-call allocations (Hazard #4)
1. Replace torch.bincount with scatter_add_ into pre-allocated buffer
   - bincount produces data-dependent shapes → breaks graph capture
   - scatter_add_ with pre-allocated _tokens_per_expert_buf (fixed shape)
   - Pre-allocated _ones_buf to avoid per-call torch.ones()

2. Replace torch.full for l1_gsa with pre-allocated buffer + fill_
   - torch.full allocates every call → breaks graph capture
   - Use self._l1_gsa_buf.fill_(l1_gs) instead
2026-06-03 17:37:03 +00:00
df05289d6f CUDA graph: Fix remaining sync violations from B200 detector run 2
1. grouped_linear.py: Remove conditional host read of GPU tensor
   - 'if group_offsets[0] != 0' reads GPU value on host → sync
   - Fix: unconditionally update offsets every call (GPU-only multiply)

2. test_cuda_graph_readiness.py: Use pinned CPU buffers for token transfer
   - dec_tid_buf[0] = python_int → CPU→GPU sync
   - Fix: write to pinned CPU buffer, then copy_ (async, graph-capturable)

3. Add dsv4/decode/cuda_graph_decoder.py (skeleton)
2026-06-03 17:20:34 +00:00
e07d79868f CUDA graph: Fix _assemble_scales_single_group swizzle size
The pre-allocated buffer is max-sized, but pad_and_swizzle_single
operates on the full buffer dimensions. Fix: pass a correctly-sized
view (buf[:padded_rows, :padded_cols]) so the swizzle produces the
right output size.

Same fix applied to both linear.py and shared_expert.py.
2026-06-03 17:02:34 +00:00
0ca7bed0e1 CUDA graph: Fix sync violations found by B200 detector
Fixes from running Section A detector on B200:

1. single_shot_inference.py: Use pinned CPU buffers for token/position transfer
   - dec_tid_buf[0] = python_int causes CPU→GPU sync
   - Fixed: write to pinned CPU buffer, then copy_ (async, graph-capturable)

2. grouped_linear.py: Fix expert_offsets Python loop
   - expert_offsets[g] = python_int * padded_rows → CPU→GPU sync per iteration
   - Fixed: element-wise multiply with pre-allocated range tensor (GPU-only)

3. grouped_linear.py: Vectorized output extraction for T=1 decode
   - Python loop z[:, g, :] = out[...] → CPU sync for each slice
   - Fixed: GPU gather with pre-computed indices for T=1

4. grouped_linear.py: Pre-allocate output buffer
   - torch.empty() per call → allocation inside graph
   - Fixed: use self._output_buf (pre-allocated at max size)

5. grouped_linear.py: Pre-allocate expert_offsets_range_buf
   - torch.arange() per call → allocation inside graph
   - Fixed: compute once at init, reuse via element-wise multiply
2026-06-03 16:52:19 +00:00
46a3a51832 CUDA graph: Fix per-step allocations in decode loop
1. mHCLayer.init_state: Add out_buf parameter for in-place write
   - Pre-allocated dec_X_buf (1, 4, 7168) on cuda:0
   - Eliminates .unsqueeze().expand().clone() allocation each step

2. single_shot_inference.py: Pre-allocate dec_embed_buf
   - Placeholder for embedding output (graph capture will use this)

3. Note: Cross-GPU X.to() transfers still allocate per step
   - This requires per-GPU X buffers (part of graph capture architecture)
2026-06-03 16:38:35 +00:00
a9ea30353c CUDA graph: Fix sync violations (Category 1-2)
1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated)
   - The X_next.abs().max().item() was syncing EVERY layer's post_block
   - Diagnostics moved to caller (outside graph region)

2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size
   - _assemble_scales_single_group now uses pre-allocated buffer
   - Eliminates per-call torch.zeros() allocation (graph capture killer)

3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf
   - _assemble_scales_single_group no longer allocates

4. quantize.py: Remove .contiguous() from gsa expand
   - expand() creates stride-0 view, CUDA kernel reads correctly
   - No allocation on the hot path

5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog
2026-06-03 16:37:20 +00:00
caac8ae108 Fix syntax error: 'is not not None' -> 'is not None' 2026-06-03 16:34:33 +00:00
ba68212fa7 Add CUDA graph readiness detector (Section A of GETTING_CUDAGRAPH_READY.md)
- Grep for Section B sync patterns in hot path files
- Method 1: run decode forward with torch.cuda.set_sync_debug_mode('error')
- Method 2: attempt CUDA graph capture of L0 decode step
- Full model load + prefill + warmup before detection
- Results saved to /tmp/cuda_graph_readiness_results.json
2026-06-03 16:34:15 +00:00
ca5bc814d5 Fix compressor: do not add positional bias to KV content
The positional bias (ape/B) should only modulate the compression
softmax logits (Z + B), NOT be added to the KV content itself.

Paper equation: compressed = softmax(Z + B) · C
Bug was doing: compressed = softmax(Z + B) · (C + B) — poisons every
compressed KV entry with learned positional-bias content.

Fixed in both CSA (compress_csa_reduce_kernel) and HCA
(hca_compress_reduce_kernel) paths in compressor_reduce.cu.
2026-06-03 15:52:00 +00:00
4fe73fe713 auto: pre-test commit v-precision-floor-fix-20260603 2026-06-03 15:45:15 +00:00
f577ed97f4 Fix: Use PyTorch dequant_nvfp4 for weight dequantization (compressor/indexer/router gate)
The CUDA dequantize_nvfp4 (dsv4/ops/quantize.py) was designed for
activations/KV and assumes row-major (M, N/16) scale layout. Using it
for weight dequantization caused async illegal memory access because
weight scales don't match the kernel's expected layout. The kernel only
validates row count, not width or contiguity.

All 4 call sites now use the PyTorch dequant_nvfp4 (defined in
single_shot_inference.py) which handles weight_scale_2 and input_scale
correctly and cannot cause OOB access:
- Compressor.load: kv_proj, gate_proj
- Indexer.load: weights_proj
- Router gate dequantization in main()
2026-06-03 14:57:40 +00:00
1121cd7b47 Add CUDA_LAUNCH_BLOCKING=1 to catch async errors 2026-06-03 14:48:51 +00:00
f3bb0ca08c Fix dequant gsa: use ws2 only, NOT input_scale * ws2
For weight dequantization, gsa should be weight_scale_2 only.
input_scale is the activation global scale — it belongs on the GEMM's
activation side, not the weight side. Using input_scale * ws2 gave
gsa = 6e-8 (essentially zero), making dequantized weights ~0.

The GEMM formula is y = (x * scale_a * gsa) @ (w * scale_b * gsb)
where gsb = input_scale * ws2. But dequantize_nvfp4 is just the
weight half: w_bf16 = lut[w] * block_scale * ws2.
2026-06-03 14:38:24 +00:00
470e65fb19 Fix dequant gsb: input_scale * ws2, not 1.0 * ws2
The NVFP4 dequantize formula is w = lut[w_packed] * scale * ws2,
and in the GEMM the global_scale_b = input_scale * ws2. Was incorrectly
using gsb = 1.0 * ws2 (missing input_scale). This would produce
wrongly-scaled BF16 weights from dequantize_nvfp4.
2026-06-03 14:26:59 +00:00
2dd16d5789 Switch compressor + indexer weights_proj to BF16 F.linear
Only the CSA indexer QK path (q_b_proj) is explicitly FP4-QATed.
The rest of the compressor/indexer projections are NOT, so use BF16:

- Compressor kv_proj, gate_proj: dequantize NVFP4 → BF16, F.linear
- Indexer weights_proj: dequantize NVFP4 → BF16, F.linear
- Indexer q_b_proj: KEEP as NVFP4 (this IS the FP4-QATed path)
- Indexer compressor: inherits Compressor's BF16 path
2026-06-03 14:19:41 +00:00
95e45a87e3 Add explicit .to(dev) on W_gate after transpose — belt and suspenders 2026-06-03 14:17:02 +00:00