The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.
This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.
Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.
Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.
- gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and
run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers
- quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in
torch.where() calls (graph-capturable, no memory allocation)
- Both fixes prevent 'Disallowed operation during CUDA stream capture'
errors during graph capture
Patch torch.cuda.current_device to return the tensor's device index
during from_dlpack calls inside CUDA graph capture. This bypasses the
device check in __dlpack__ without changing the CUDA stream (which
caused 'Capture must end on the same stream' in v1) and without
triggering a cross-device copy (which caused 'Cannot copy between
CPU and CUDA tensors' in v2).
Previous fix (set_device) caused 'Capture must end on the same stream'.
New fix: wrap tensor in _DLPatchTensor during graph capture, which forces
dl_device in __dlpack__ to bypass the device check without changing the stream.
This enables CUDA graph capture on all 8 GPUs, not just cuda:0.
When capturing CUDA graphs on non-default GPUs, torch.cuda.current_device()
may not match the tensor's device. from_dlpack() checks this and fails.
Fix: set the current device to match the tensor's device before from_dlpack.
This enables graph capture on all 8 GPUs, not just cuda:0.
The stride-0 expand view for gsa_gpu caused illegal memory access
in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle
stride-0 tensors correctly.
Fix:
- M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc
- M>1 prefill (not graph-captured): expand + contiguous — allocation OK
1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle
- Same fix as linear.py — avoids torch.zeros per call
- Uses correctly-sized view for pad_and_swizzle_single
2. quantize.py: Replace torch.zeros_like with scalar 0.0
- torch.zeros_like allocates a full tensor every call
- torch.where(cond, 0.0, x) broadcasts scalar — no allocation
1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated)
- The X_next.abs().max().item() was syncing EVERY layer's post_block
- Diagnostics moved to caller (outside graph region)
2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size
- _assemble_scales_single_group now uses pre-allocated buffer
- Eliminates per-call torch.zeros() allocation (graph capture killer)
3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf
- _assemble_scales_single_group no longer allocates
4. quantize.py: Remove .contiguous() from gsa expand
- expand() creates stride-0 view, CUDA kernel reads correctly
- No allocation on the hot path
5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog
The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.
Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.
The single-kernel approach used __syncthreads() for cross-CTA amax
reduction, but __syncthreads() only syncs within a CTA (same blockIdx).
CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa.
Result: residual |X| exploded to 10^37 by L0. F_attn and F_ffn were 0.0.
Fix: Two-kernel approach (correct, zero CPU syncs):
Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor
Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer
The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and
deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param).
Same P0 win: zero .item() syncs. Two kernel launches instead of one,
but correctness > shaving one launch.
Fused kernels (zero CPU sync, single kernel launch per projection):
- fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step
compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync).
- fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path.
Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu
+ deinterleave_quantize_nvfp4_cuda (had .item() sync).
All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache).
Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call,
~500 calls/token). Now compiles once and reuses the cached module.
Updated layers:
- linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer
- moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and
non-fused paths)
- shared_expert.py: fused for L1 and L2
- quantize.py: All functions use module loader cache
- sampler.py: Uses module loader cache
- indexer/score_topk.py: Uses module loader cache
P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop.
2 kernel launches instead of 2T. No .item() in comp_pos either.
P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat.
max_comp=32768 per layer (32MB). No more quadratic memory growth.
~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).
- run_nvfp4_grouped_gemm: use_2cta = tokens_sum >= 256 and cluster_m even
- run_fused_swiglu_grouped_gemm: same conditional
- Auto-warms up on first use via lazy compilation cache
- 1.7-1.9× throughput at prefill shapes (M>=256)
- Decode (M<256) stays 1-CTA (correct, no waste)