Commit Graph

2399 Commits

Author SHA1 Message Date
ac213bdee8 Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager) 2026-06-06 08:29:40 +00:00
6650f06121 CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug 2026-06-06 08:18:18 +00:00
90ac38cde0 Add CUDA graph stream management test 2026-06-06 08:14:29 +00:00
26042e3f01 Add minimal CUDA graph multi-GPU test to isolate zero-output bug 2026-06-06 08:13:18 +00:00
86275851d4 Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue 2026-06-06 08:02:35 +00:00
2cbf7a43e9 Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification 2026-06-06 07:51:22 +00:00
2bb52c7cae Add per-layer graph capture verification — replay immediately and check for zeros 2026-06-06 07:40:19 +00:00
5a98cc6d90 Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug 2026-06-06 07:29:33 +00:00
dcb2495a5b Add graph replay debug prints for first 3 steps/layers 2026-06-06 07:19:07 +00:00
16b9a4def2 Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay 2026-06-06 07:18:49 +00:00
f259d63930 CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path 2026-06-06 07:01:52 +00:00
32902d1036 CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se 2026-06-06 07:01:12 +00:00
64f547058e Fix graph replay: pass q_a from Graph A output to forward_attention
- q_a is needed by the indexer in CSA layers
- When q_heads/kv_3d are provided (graph replay), the projection code is
  skipped so q_a is never computed
- Fix: add q_a_bufs to CUDAGraphDecoder, write q_a during Graph A capture,
  pass q_a as kwarg to forward_attention during graph replay
- Also: forward_attention now accepts q_a kwarg (default None)
2026-06-04 08:09:30 +00:00
26da6d33af Fix graph replay: remove extra token_id arg from forward_attention call
The forward_attention() signature has no token_id parameter, but the
graph replay path was passing dec_tid32_per_gpu[gpu] between positions
and compressor — causing the int tensor to be interpreted as compressor
and triggering AttributeError: 'int' object has no attribute 'ratio'
2026-06-04 06:10:02 +00:00
ae26f6b83c Fix dense router BF16 dispatch: use torch.matmul instead of F.linear
- F.linear(x, W) computes x @ W.T which caused shape mismatch when
  W_gate was pre-transposed to [E, H]
- Use torch.matmul(x, W_gate) instead — computes x @ W directly, no
  transpose needed, no FP32 conversion, fully graph-capturable
- W_gate stays as [H, E] (original checkpoint shape)
2026-06-04 05:58:24 +00:00
e46b615873 Fix dense router BF16 dispatch for CUDA graph capture
- Run GEMM in BF16 (not FP32) during graph capture — Blackwell tensor cores
  handle BF16 natively; FP32 GEMM triggers cudaErrorStreamCaptureUnsupported
- Pre-transpose W_gate to [E, H] at load time — avoids .T view during capture
- Convert only logits output to FP32 for sqrt(softplus) numerical stability
- This fixes the graph capture failure at layer 0 Graph B
2026-06-04 05:50:13 +00:00
b4a59d0940 Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers
GETTING_CUDAGRAPH_READY.md:
- Updated architecture section for A/B split (Graph A + eager attention + Graph B)
- Updated Section D integration order with current progress
- Added all recent violation fix commits

CUDA_GRAPH_SYNC_INVENTORY.md:
- Added Category 6 fixes: _l1_out_buf 2x fix, GEMM output pre-allocation, swizzle CUDA kernel, gsa scalar assignment, router BF16 fix
- Added remaining blockers for next session
- Updated CUDAGraphDecoder architecture description for A/B split
- Added capture/replay flow description
2026-06-04 05:13:51 +00:00
ffa7842b58 Fix dense router: run GEMM in BF16, convert to FP32 only for activation
hidden_states.float() and gate_bf16.T.float() create new FP32 tensors
during CUDA graph capture, which is not graph-capturable.

Fix: run the linear in BF16 (Blackwell tensor cores handle BF16 natively),
then convert only the output logits to FP32 for numerical stability
in sqrt(softplus). The single logits.float() is graph-capturable
because it's a unary op with a pre-existing output buffer.
2026-06-04 04:49:08 +00:00
119e6d471e Add safety check for swizzled buffers: fall through to Python path if None 2026-06-04 04:32:00 +00:00
fae61d3ef7 Add c10/cuda/CUDAStream.h include for getCurrentCUDAStream 2026-06-04 04:13:40 +00:00
ee86969f6c Fix CUDA stream: use c10::cuda::getCurrentCUDAStream() directly in kernel launch 2026-06-04 03:57:59 +00:00
e26c28a1ce Fix CUDA stream API: getCurrentCUDAStream().stream() 2026-06-04 03:43:04 +00:00
9b3917e248 Fix blackwell_swizzle.cu: add pybind11 bindings for torch extension loader 2026-06-04 03:29:10 +00:00
5487a58df4 Fix NameError: add rows/cols variables to MoE swizzle 2026-06-04 03:14:27 +00:00
a434545d12 Blackwell swizzle CUDA kernel for CUDA graph capture
Python view operations (reshape, transpose, permute) are not
graph-capturable — they cause cudaErrorStreamCaptureUnsupported.

Added:
- dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle
- to_blocked(): detects graph capture, uses CUDA kernel instead of Python views
- MoE _assemble_scales_cudagraph_safe: same treatment
- Shared expert _assemble_scales_single_group: same treatment
- Linear _assemble_scales_single_group: same treatment
- Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like)

The CUDA kernel writes to a pre-allocated buffer — no per-step allocations.
Eager path unchanged (still uses fast Python view operations).
2026-06-04 03:03:02 +00:00
e7766254b7 Pre-allocate ALL GEMM output buffers for CUDA graph capture
Every run_nvfp4_grouped_gemm call must pass out= with a pre-allocated
buffer. During CUDA graph capture, torch.zeros() allocations are
forbidden — they cause 'cudaErrorStreamCaptureUnsupported' errors.

Added:
- shared_expert: _l2_out_buf for L2 GEMM
- shared_expert: pass out= for both L1 and L2 GEMM calls
- moe: _l2_out_buf for L2 GEMM
- moe: pass out= for unfused L1 GEMM (fused L1 already had it)
- moe: pass out= for L2 GEMM
- linear: _gemm_out_buf for all GEMM calls
- linear: pass out= for both run() and run_from_quantized() paths

grouped_linear already had _output_buf_padded — no changes needed.
2026-06-04 02:41:59 +00:00
676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption
The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.

This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.

Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.

Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.
2026-06-04 02:06:18 +00:00
0890e578f4 DEBUG: print l1_out shape before gate/up split 2026-06-04 01:49:12 +00:00
8546ed725f DEBUG: check SE input magnitude 2026-06-04 01:38:24 +00:00
26ecf96328 DEBUG: check intermediate magnitude before SE L2 2026-06-04 01:30:29 +00:00
5303d6a82f DEBUG: test copy_ with contiguous slice vs scalar assign for gsa 2026-06-04 01:27:25 +00:00
ccbc713658 DEBUG: check gsa values and pinpoint exact failing operation 2026-06-04 01:16:37 +00:00
e77455c3ba DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors 2026-06-04 01:05:47 +00:00
55def5eef9 Restore A/B split + gsa scalar fix (error is pre-existing, not regression) 2026-06-04 01:03:36 +00:00
59eccd04ab REVERT: test if cudaErrorInvalidValue is pre-existing or regression 2026-06-04 00:53:09 +00:00
5e3ced0b60 DEBUG: isolate which kernel causes cudaErrorInvalidValue in SE L2 path 2026-06-04 00:41:28 +00:00
b314fde9b7 Fix gsa copy_ cudaErrorInvalidValue: replace view-based copy_ with scalar assignment
The pattern  causes
cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view
(e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1).

Root cause: copy_() from an expanded/reshaped view can fail when the
source tensor has non-standard strides. The expand() operation creates
a view with stride-0 dimensions that copy_() may not handle correctly
on all CUDA versions.

Fix: Replace all gsa copy_ patterns with scalar assignment:
  self._gsa_buf[0] = gsa_gpu[0]  # scalar GPU→GPU, graph-capturable

This is simpler, avoids view issues, and is CUDA-graph-compatible.
Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py
2026-06-04 00:30:21 +00:00
993bb345d1 DEBUG: fix VERBOSE reference in shared_expert, always print L2 gsa debug 2026-06-04 00:15:38 +00:00
f0f87df906 DEBUG: add sync + shape prints to shared_expert L2 gsa copy 2026-06-04 00:05:08 +00:00
1d6610c46d CUDA graph A/B split: eager-break-at-attention architecture
CUDAGraphDecoder now splits each layer into two graph-captured regions
with eager attention in between:

  Graph A (pre-attention):  mHC pre_block + fused RMSNorm + quantize
                              + q_a/q_b/kv projections
                              → writes intermediates to pre-allocated buffers
  Eager (attention):          Compressor → Indexer → FMHA → o_proj
                              → dynamic shapes, data-dependent control flow
  Graph B (post-attention):   mHC post_block + FFN + Router + MoE + SE
                              → writes X_next to pre-allocated output buffer

The attention path has dynamic shapes (FMHA seq_len grows, compressor
returns None) and cannot be captured. The compute path has fixed shapes
for T=1 decode and CAN be captured.

Changes:
- CUDAGraphDecoder: 2 graphs per layer (A/B) + lm_head graph
- Pre-allocated intermediate buffers for graph A → eager → graph B boundary
- forward_attention: accepts optional q_heads/kv_3d to skip projections
- Replay loop: graph A → eager attention → graph B per layer

This replaces the single-graph-per-layer approach which failed at L1+
because the attention path contains data-dependent control flow and
dynamic shapes that cannot be captured.
2026-06-03 23:53:08 +00:00
800e974d20 Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress
- Category 6: Per-step allocations (partially fixed, 6 done, ~6 blocking)
- Category 7: CuTeDSL from_dlpack fix (v3 works, v1/v2 failed)
- Category 8: Cross-GPU operations in graph capture (fixed)
- CUDAGraphDecoder architecture: single-graph-per-layer (simplified from A/B split)
- Multi-layer capture still blocked by Category 6 allocations
2026-06-03 23:41:42 +00:00
a468f72a0e CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert
Pass out= parameter to run_fused_swiglu_grouped_gemm to avoid per-step
torch.zeros() allocation during CUDA graph capture.
2026-06-03 23:17:43 +00:00
56b816a54f CUDA graph: Use per-GPU position/token buffers for graph capture
Cross-GPU .to() calls inside graph capture cause 'dependency on uncaptured
work in another stream'. Fix: pass dec_pos_per_gpu/dec_tid32_per_gpu to
capture() so each layer's graph uses buffers on its own GPU.
2026-06-03 22:56:20 +00:00
f57de06eb5 Fix grouped_linear GEMM output buffer shape and extraction
- _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output
- Extraction: groups are stacked vertically, not horizontally
- Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns
2026-06-03 22:26:40 +00:00
92225b07e7 CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split)
The A/B split approach was too complex: it required splitting forward_layer,
handling the eager FMHA section, and fixing per-GPU buffer issues. The
simpler approach captures the entire forward_layer as one graph per layer,
just like the detector test did for L0.

This works because:
- FMHA pads KV to 128 → fixed shape for graph capture
- Compressor returns None on non-boundary steps → graph captures the path
  taken during warmup (typically the None path for HCA r=128)
- All sync violations were already fixed in previous commits

The capture still uses dec_pos_buf/dec_tid32_buf on cuda:0 (forward_layer
handles device transfer internally).
2026-06-03 22:04:18 +00:00
b32713c302 grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture)
Add _output_buf_padded for the flat GEMM output, pass as out= parameter
to run_nvfp4_grouped_gemm to avoid per-step torch.zeros() allocation.
2026-06-03 22:02:01 +00:00
676fad064f Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature 2026-06-03 21:45:15 +00:00
188ecae47f CUDA graph: Eliminate per-step allocations in graph-captured code paths
- gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and
  run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers
- quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in
  torch.where() calls (graph-capturable, no memory allocation)
- Both fixes prevent 'Disallowed operation during CUDA stream capture'
  errors during graph capture
2026-06-03 21:30:24 +00:00
91c370360a Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3)
Patch torch.cuda.current_device to return the tensor's device index
during from_dlpack calls inside CUDA graph capture. This bypasses the
device check in __dlpack__ without changing the CUDA stream (which
caused 'Capture must end on the same stream' in v1) and without
triggering a cross-device copy (which caused 'Cannot copy between
CPU and CUDA tensors' in v2).
2026-06-03 21:09:12 +00:00
5c94dbbc37 Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2)
Previous fix (set_device) caused 'Capture must end on the same stream'.
New fix: wrap tensor in _DLPatchTensor during graph capture, which forces
dl_device in __dlpack__ to bypass the device check without changing the stream.

This enables CUDA graph capture on all 8 GPUs, not just cuda:0.
2026-06-03 20:54:18 +00:00