- q_a is needed by the indexer in CSA layers
- When q_heads/kv_3d are provided (graph replay), the projection code is
skipped so q_a is never computed
- Fix: add q_a_bufs to CUDAGraphDecoder, write q_a during Graph A capture,
pass q_a as kwarg to forward_attention during graph replay
- Also: forward_attention now accepts q_a kwarg (default None)
The forward_attention() signature has no token_id parameter, but the
graph replay path was passing dec_tid32_per_gpu[gpu] between positions
and compressor — causing the int tensor to be interpreted as compressor
and triggering AttributeError: 'int' object has no attribute 'ratio'
- F.linear(x, W) computes x @ W.T which caused shape mismatch when
W_gate was pre-transposed to [E, H]
- Use torch.matmul(x, W_gate) instead — computes x @ W directly, no
transpose needed, no FP32 conversion, fully graph-capturable
- W_gate stays as [H, E] (original checkpoint shape)
- Run GEMM in BF16 (not FP32) during graph capture — Blackwell tensor cores
handle BF16 natively; FP32 GEMM triggers cudaErrorStreamCaptureUnsupported
- Pre-transpose W_gate to [E, H] at load time — avoids .T view during capture
- Convert only logits output to FP32 for sqrt(softplus) numerical stability
- This fixes the graph capture failure at layer 0 Graph B
hidden_states.float() and gate_bf16.T.float() create new FP32 tensors
during CUDA graph capture, which is not graph-capturable.
Fix: run the linear in BF16 (Blackwell tensor cores handle BF16 natively),
then convert only the output logits to FP32 for numerical stability
in sqrt(softplus). The single logits.float() is graph-capturable
because it's a unary op with a pre-existing output buffer.
Python view operations (reshape, transpose, permute) are not
graph-capturable — they cause cudaErrorStreamCaptureUnsupported.
Added:
- dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle
- to_blocked(): detects graph capture, uses CUDA kernel instead of Python views
- MoE _assemble_scales_cudagraph_safe: same treatment
- Shared expert _assemble_scales_single_group: same treatment
- Linear _assemble_scales_single_group: same treatment
- Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like)
The CUDA kernel writes to a pre-allocated buffer — no per-step allocations.
Eager path unchanged (still uses fast Python view operations).
Every run_nvfp4_grouped_gemm call must pass out= with a pre-allocated
buffer. During CUDA graph capture, torch.zeros() allocations are
forbidden — they cause 'cudaErrorStreamCaptureUnsupported' errors.
Added:
- shared_expert: _l2_out_buf for L2 GEMM
- shared_expert: pass out= for both L1 and L2 GEMM calls
- moe: _l2_out_buf for L2 GEMM
- moe: pass out= for unfused L1 GEMM (fused L1 already had it)
- moe: pass out= for L2 GEMM
- linear: _gemm_out_buf for all GEMM calls
- linear: pass out= for both run() and run_from_quantized() paths
grouped_linear already had _output_buf_padded — no changes needed.
The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.
This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.
Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.
Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.
The pattern causes
cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view
(e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1).
Root cause: copy_() from an expanded/reshaped view can fail when the
source tensor has non-standard strides. The expand() operation creates
a view with stride-0 dimensions that copy_() may not handle correctly
on all CUDA versions.
Fix: Replace all gsa copy_ patterns with scalar assignment:
self._gsa_buf[0] = gsa_gpu[0] # scalar GPU→GPU, graph-capturable
This is simpler, avoids view issues, and is CUDA-graph-compatible.
Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py
CUDAGraphDecoder now splits each layer into two graph-captured regions
with eager attention in between:
Graph A (pre-attention): mHC pre_block + fused RMSNorm + quantize
+ q_a/q_b/kv projections
→ writes intermediates to pre-allocated buffers
Eager (attention): Compressor → Indexer → FMHA → o_proj
→ dynamic shapes, data-dependent control flow
Graph B (post-attention): mHC post_block + FFN + Router + MoE + SE
→ writes X_next to pre-allocated output buffer
The attention path has dynamic shapes (FMHA seq_len grows, compressor
returns None) and cannot be captured. The compute path has fixed shapes
for T=1 decode and CAN be captured.
Changes:
- CUDAGraphDecoder: 2 graphs per layer (A/B) + lm_head graph
- Pre-allocated intermediate buffers for graph A → eager → graph B boundary
- forward_attention: accepts optional q_heads/kv_3d to skip projections
- Replay loop: graph A → eager attention → graph B per layer
This replaces the single-graph-per-layer approach which failed at L1+
because the attention path contains data-dependent control flow and
dynamic shapes that cannot be captured.
Cross-GPU .to() calls inside graph capture cause 'dependency on uncaptured
work in another stream'. Fix: pass dec_pos_per_gpu/dec_tid32_per_gpu to
capture() so each layer's graph uses buffers on its own GPU.
- _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output
- Extraction: groups are stacked vertically, not horizontally
- Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns
The A/B split approach was too complex: it required splitting forward_layer,
handling the eager FMHA section, and fixing per-GPU buffer issues. The
simpler approach captures the entire forward_layer as one graph per layer,
just like the detector test did for L0.
This works because:
- FMHA pads KV to 128 → fixed shape for graph capture
- Compressor returns None on non-boundary steps → graph captures the path
taken during warmup (typically the None path for HCA r=128)
- All sync violations were already fixed in previous commits
The capture still uses dec_pos_buf/dec_tid32_buf on cuda:0 (forward_layer
handles device transfer internally).
- gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and
run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers
- quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in
torch.where() calls (graph-capturable, no memory allocation)
- Both fixes prevent 'Disallowed operation during CUDA stream capture'
errors during graph capture
Patch torch.cuda.current_device to return the tensor's device index
during from_dlpack calls inside CUDA graph capture. This bypasses the
device check in __dlpack__ without changing the CUDA stream (which
caused 'Capture must end on the same stream' in v1) and without
triggering a cross-device copy (which caused 'Cannot copy between
CPU and CUDA tensors' in v2).
Previous fix (set_device) caused 'Capture must end on the same stream'.
New fix: wrap tensor in _DLPatchTensor during graph capture, which forces
dl_device in __dlpack__ to bypass the device check without changing the stream.
This enables CUDA graph capture on all 8 GPUs, not just cuda:0.