- F.linear(x, W) computes x @ W.T which caused shape mismatch when
W_gate was pre-transposed to [E, H]
- Use torch.matmul(x, W_gate) instead — computes x @ W directly, no
transpose needed, no FP32 conversion, fully graph-capturable
- W_gate stays as [H, E] (original checkpoint shape)
- Run GEMM in BF16 (not FP32) during graph capture — Blackwell tensor cores
handle BF16 natively; FP32 GEMM triggers cudaErrorStreamCaptureUnsupported
- Pre-transpose W_gate to [E, H] at load time — avoids .T view during capture
- Convert only logits output to FP32 for sqrt(softplus) numerical stability
- This fixes the graph capture failure at layer 0 Graph B
Python view operations (reshape, transpose, permute) are not
graph-capturable — they cause cudaErrorStreamCaptureUnsupported.
Added:
- dsv4/kernels/cuda/blackwell_swizzle.cu: custom CUDA kernel for 32_4_4 swizzle
- to_blocked(): detects graph capture, uses CUDA kernel instead of Python views
- MoE _assemble_scales_cudagraph_safe: same treatment
- Shared expert _assemble_scales_single_group: same treatment
- Linear _assemble_scales_single_group: same treatment
- Pre-allocated swizzled output buffers for all layers (avoids torch.empty_like)
The CUDA kernel writes to a pre-allocated buffer — no per-step allocations.
Eager path unchanged (still uses fast Python view operations).
Every run_nvfp4_grouped_gemm call must pass out= with a pre-allocated
buffer. During CUDA graph capture, torch.zeros() allocations are
forbidden — they cause 'cudaErrorStreamCaptureUnsupported' errors.
Added:
- shared_expert: _l2_out_buf for L2 GEMM
- shared_expert: pass out= for both L1 and L2 GEMM calls
- moe: _l2_out_buf for L2 GEMM
- moe: pass out= for unfused L1 GEMM (fused L1 already had it)
- moe: pass out= for L2 GEMM
- linear: _gemm_out_buf for all GEMM calls
- linear: pass out= for both run() and run_from_quantized() paths
grouped_linear already had _output_buf_padded — no changes needed.
The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.
This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.
Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.
Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.
The pattern causes
cudaErrorInvalidValue when gsa_gpu is a non-contiguous expanded view
(e.g., shape (9,) from quantize_nvfp4_gpu_fused during prefill with M>1).
Root cause: copy_() from an expanded/reshaped view can fail when the
source tensor has non-standard strides. The expand() operation creates
a view with stride-0 dimensions that copy_() may not handle correctly
on all CUDA versions.
Fix: Replace all gsa copy_ patterns with scalar assignment:
self._gsa_buf[0] = gsa_gpu[0] # scalar GPU→GPU, graph-capturable
This is simpler, avoids view issues, and is CUDA-graph-compatible.
Applied to: shared_expert.py, moe.py, linear.py, grouped_linear.py
- _output_buf_padded: (max_tokens * n_groups, o_lora_rank) — matches GEMM output
- Extraction: groups are stacked vertically, not horizontally
- Each group's output is (padded_rows, o_lora_rank) with o_lora_rank columns
1. scatter_add_ requires int64 indices — ensure sorted_ids is .long()
2. Fixed the SECOND torch.bincount call (line 590) — same scatter_add_ pattern
3. Both code paths now use pre-allocated _tokens_per_expert_buf
1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle
- Same fix as linear.py — avoids torch.zeros per call
- Uses correctly-sized view for pad_and_swizzle_single
2. quantize.py: Replace torch.zeros_like with scalar 0.0
- torch.zeros_like allocates a full tensor every call
- torch.where(cond, 0.0, x) broadcasts scalar — no allocation
The pre-allocated buffer is max-sized, but pad_and_swizzle_single
operates on the full buffer dimensions. Fix: pass a correctly-sized
view (buf[:padded_rows, :padded_cols]) so the swizzle produces the
right output size.
Same fix applied to both linear.py and shared_expert.py.
Fixes from running Section A detector on B200:
1. single_shot_inference.py: Use pinned CPU buffers for token/position transfer
- dec_tid_buf[0] = python_int causes CPU→GPU sync
- Fixed: write to pinned CPU buffer, then copy_ (async, graph-capturable)
2. grouped_linear.py: Fix expert_offsets Python loop
- expert_offsets[g] = python_int * padded_rows → CPU→GPU sync per iteration
- Fixed: element-wise multiply with pre-allocated range tensor (GPU-only)
3. grouped_linear.py: Vectorized output extraction for T=1 decode
- Python loop z[:, g, :] = out[...] → CPU sync for each slice
- Fixed: GPU gather with pre-computed indices for T=1
4. grouped_linear.py: Pre-allocate output buffer
- torch.empty() per call → allocation inside graph
- Fixed: use self._output_buf (pre-allocated at max size)
5. grouped_linear.py: Pre-allocate expert_offsets_range_buf
- torch.arange() per call → allocation inside graph
- Fixed: compute once at init, reuse via element-wise multiply
1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated)
- The X_next.abs().max().item() was syncing EVERY layer's post_block
- Diagnostics moved to caller (outside graph region)
2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size
- _assemble_scales_single_group now uses pre-allocated buffer
- Eliminates per-call torch.zeros() allocation (graph capture killer)
3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf
- _assemble_scales_single_group no longer allocates
4. quantize.py: Remove .contiguous() from gsa expand
- expand() creates stride-0 view, CUDA kernel reads correctly
- No allocation on the hot path
5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog
The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.
Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.
Root cause: float row_max[n] is a VLA — not allowed in CUDA device code.
Fix: use shared memory with MHC_MAX_N=16 fixed-size slots.
Also: REMOVED the Python fallback in sinkhorn_knopp().
If the CUDA kernel fails, the pipeline DIES. No soft landing.
This is the correct behavior — silent fallback to broken precision
is worse than a loud crash.
The residual growth |X|→500-700 at L60 was likely caused by the Python
fallback running a DIFFERENT numerical path (BF16 accumulation in torch
ops vs FP32 in the CUDA kernel). With the fixed kernel, Sinkhorn should
produce properly doubly-stochastic B_l, bounding the residual.
The fill_() is a CPU→GPU scalar write (tiny cost). The optimization
was marginal and the output quality regression (CJK tokens) needs
investigation separately. P2 can re-land after the regression is
confirmed to be sampling-related (not gsa-related).
P0/P1 (fused SwiGLU) still disabled — kernel arg-binding bug unfixed.
P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers
'TypeError: too many positional arguments' during cute.compile(). The kernel
signature doesn't match the positional args being passed. This is a kernel-side
fix, not a single_shot fix. Disabled until the fused kernel is debugged.
P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup.
SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved
weight path) is wired but disabled. Will activate once kernel fix lands.
- grouped_linear.py: Replace .item() gsa + Python quantize with
quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups
into (G*T, D), single fused kernel launch, GPU-only gsa copy.
- single_shot_inference.py: Reduce torch.cuda.synchronize() to
every 20 steps instead of every step. Gate per-layer diagnostics
to li<3 or li>=58 (avoid 61 .item() calls per decode step).
Fused kernels (zero CPU sync, single kernel launch per projection):
- fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step
compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync).
- fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path.
Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu
+ deinterleave_quantize_nvfp4_cuda (had .item() sync).
All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache).
Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call,
~500 calls/token). Now compiles once and reuses the cached module.
Updated layers:
- linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer
- moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and
non-fused paths)
- shared_expert.py: fused for L1 and L2
- quantize.py: All functions use module loader cache
- sampler.py: Uses module loader cache
- indexer/score_topk.py: Uses module loader cache
P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop.
2 kernel launches instead of 2T. No .item() in comp_pos either.
P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat.
max_comp=32768 per layer (32MB). No more quadratic memory growth.
~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).
1. o_a_proj (Nvfp4GroupedLinear): Added load_nvfp4_weight() method
that loads checkpoint NVFP4 weights directly — no more dequant→BF16→requant.
Each group's weight is transposed from (N, K_packed) checkpoint layout
to (K_packed, N) layout expected by the grouped GEMM.
2. lm_head: Quantize BF16 weight to NVFP4 at load time, use production
Nvfp4Linear GEMM instead of F.linear. Runtime gsa for activation.
Frees the 1.8GB BF16 weight after quantization.
3. Hash router (L0-2): Already optimal — tid2eid is an int32 lookup,
no GEMM to accelerate.