21 Commits

Author SHA1 Message Date
676a0448c0 CRITICAL FIX: _l1_out_buf was 2x too narrow — caused GPU memory corruption
The L1 GEMM produces gate+up combined output with 2*intermediate_size
BF16 columns, but _l1_out_buf was only allocated with intermediate_size
columns. The GEMM wrote past the buffer boundary, corrupting GPU memory
and causing cudaErrorInvalidValue on subsequent operations.

This was the root cause of ALL the cudaErrorInvalidValue errors in the
shared expert and MoE L2 paths — the corrupted memory from the L1 buffer
overflow propagated downstream.

Fix: _l1_out_buf shape (max_rows, 2*intermediate_size) instead of
(max_rows, intermediate_size). Applied to both shared_expert.py and moe.py.

Also removed all DEBUG sync/print statements from quantize.py and
shared_expert.py — the bug was not in the quantize kernels, it was
the buffer overflow.
2026-06-04 02:06:18 +00:00
e77455c3ba DEBUG: add sync inside quantize_nvfp4_gpu_fused to catch async errors 2026-06-04 01:05:47 +00:00
188ecae47f CUDA graph: Eliminate per-step allocations in graph-captured code paths
- gemm_runner.py: Add out= parameter to run_nvfp4_grouped_gemm and
  run_fused_swiglu_grouped_gemm to accept pre-allocated output buffers
- quantize.py: Replace torch.zeros_like/torch.zeros with scalar 0.0 in
  torch.where() calls (graph-capturable, no memory allocation)
- Both fixes prevent 'Disallowed operation during CUDA stream capture'
  errors during graph capture
2026-06-03 21:30:24 +00:00
80bb27f5bf CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode
The stride-0 expand view for gsa_gpu caused illegal memory access
in quantize_nvfp4_from_buffer kernel. The CUDA kernel may not handle
stride-0 tensors correctly.

Fix:
- M=1 decode (graph-captured): just reshape scalar to (1,) — no alloc
- M>1 prefill (not graph-captured): expand + contiguous — allocation OK
2026-06-03 18:08:18 +00:00
f13a81d48b CUDA graph: Fix per-call allocations in grouped_linear and quantize
1. grouped_linear.py: Pre-allocate _scale_a_buf for swizzle
   - Same fix as linear.py — avoids torch.zeros per call
   - Uses correctly-sized view for pad_and_swizzle_single

2. quantize.py: Replace torch.zeros_like with scalar 0.0
   - torch.zeros_like allocates a full tensor every call
   - torch.where(cond, 0.0, x) broadcasts scalar — no allocation
2026-06-03 17:39:20 +00:00
a9ea30353c CUDA graph: Fix sync violations (Category 1-2)
1. mhc.py: Remove .item() from post_block (122 syncs/step eliminated)
   - The X_next.abs().max().item() was syncing EVERY layer's post_block
   - Diagnostics moved to caller (outside graph region)

2. linear.py: Pre-allocate _scale_a_buf in _ensure_buffer_size
   - _assemble_scales_single_group now uses pre-allocated buffer
   - Eliminates per-call torch.zeros() allocation (graph capture killer)

3. shared_expert.py: Same fix — use pre-allocated padded_x_sf_buf
   - _assemble_scales_single_group no longer allocates

4. quantize.py: Remove .contiguous() from gsa expand
   - expand() creates stride-0 view, CUDA kernel reads correctly
   - No allocation on the hot path

5. Add CUDA_GRAPH_SYNC_INVENTORY.md with full violation catalog
2026-06-03 16:37:20 +00:00
5e09be08af Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill)
The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.

Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.
2026-06-03 07:56:19 +00:00
c926c4a597 P5: Fix mhc_rmsnorm_quantize_nvfp4 — add proper function definition 2026-06-02 17:57:33 +00:00
bdf0b15d45 P4: Fix rmsnorm_quantize_nvfp4 returns QuantizedActivation not tuple 2026-06-02 17:43:21 +00:00
454dbdad52 P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel
- fused_mhc_rmsnorm_quantize.cu: 2-kernel approach
  Kernel 1: mhc_rmsnorm_amax_gsa — bmm + RMS + amax → gsa
  Kernel 2: mhc_rmsnorm_quantize_nvfp4 — bmm + normalize + quantize
- Python bridge: mhc_rmsnorm_quantize_nvfp4() in ops/quantize.py
- Unit test: test_fused_mhc_rmsnorm_quantize.py (production shapes)
- Eliminates ~610 kernel launches per token (122 sites × 5 launches saved)
2026-06-02 16:39:42 +00:00
0d1cd1e216 P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized
- QuantizedActivation: carries (x_fp4, x_sf, gsa) for skip-quantize path
- Nvfp4Linear.run_from_quantized(): runs GEMM with pre-quantized input
- Enables fused RMSNorm+quantize to feed directly into all downstream
  linears (q_a, kv, o_proj, etc.) without re-quantizing
2026-06-02 16:37:38 +00:00
57ab4b9d4c P4: Fix dequantize_nvfp4 bridge — handle float8_e4m3fn dtype 2026-06-02 16:31:56 +00:00
794ebaf7e5 P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+)
- fused_rmsnorm_quantize.cu: two-kernel approach
  Kernel 1: rmsnorm_amax_gsa — compute RMS + amax of normalized output → gsa per row
  Kernel 2: rmsnorm_quantize_nvfp4 — normalize + quantize using GPU-computed gsa
- Python bridge: rmsnorm_quantize_nvfp4() in ops/quantize.py
- Python bridge: dequantize_nvfp4() in ops/quantize.py
- Unit test: test_fused_rmsnorm_quantize.py (production shapes: 7168 hidden)
- Eliminates ~488 kernel launches per token (122 sites × 4 launches saved)
2026-06-02 16:26:24 +00:00
9fec7d609e Fix gsa_buffer shape mismatch for MoE (M>1 rows)
compute_amax_gsa returns a scalar, but quantize_from_buffer expects (M,).
Broadcast the scalar gsa to (M,) — all rows use the same gsa (global max).
2026-06-01 21:33:59 +00:00
cacf64232e CRITICAL FIX: fused_amax_quantize cross-CTA race condition
The single-kernel approach used __syncthreads() for cross-CTA amax
reduction, but __syncthreads() only syncs within a CTA (same blockIdx).
CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa.

Result: residual |X| exploded to 10^37 by L0. F_attn and F_ffn were 0.0.

Fix: Two-kernel approach (correct, zero CPU syncs):
  Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor
  Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer

The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and
deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param).

Same P0 win: zero .item() syncs. Two kernel launches instead of one,
but correctness > shaving one launch.
2026-06-01 21:26:51 +00:00
c8faf20a99 P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path
Fused kernels (zero CPU sync, single kernel launch per projection):
- fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step
  compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync).
- fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path.
  Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu
  + deinterleave_quantize_nvfp4_cuda (had .item() sync).

All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache).
Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call,
~500 calls/token). Now compiles once and reuses the cached module.

Updated layers:
- linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer
- moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and
  non-fused paths)
- shared_expert.py: fused for L1 and L2
- quantize.py: All functions use module loader cache
- sampler.py: Uses module loader cache
- indexer/score_topk.py: Uses module loader cache

P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop.
2 kernel launches instead of 2T. No .item() in comp_pos either.

P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat.
max_comp=32768 per layer (32MB). No more quadratic memory growth.

~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).
2026-06-01 21:05:03 +00:00
360f76b970 Performance audit fixes: eliminate CPU-GPU syncs
PERFORMANCE_AUDIT.md validation results:
  1. Nvfp4Linear .item() sync (610/step) → FIXED: compute_amax_gsa_gpu kernel
  2. MoE .item() sync (183/step) → FIXED: same kernel
  3. SharedExpert .item() sync (122/step) → FIXED: same kernel
  4. FMHA V clone → FIXED: V=K, transpose creates copy implicitly
  5. torch.cuda.synchronize in moe_forward → FIXED: conditional on VERBOSE
  6. RoPE 8x duplication → INVALIDATED: necessary for per-GPU HBM access
  7. mHC BF16 bmm → INVALIDATED: 28K FLOPs, not a bottleneck
  8. Router .float() cast → INVALIDATED: needed for FP32 topk, ~1μs

New files:
  - dsv4/kernels/cuda/amax_gsa.cu: GPU-only amax→gsa kernel
  - dsv4/ops/quantize.py: compute_amax_gsa_gpu() wrapper

Net effect: ~915 fewer CPU-GPU syncs per decode step
Remaining syncs: ~10 per layer (quantize kernel parameter) + diagnostics
2026-06-01 20:40:19 +00:00
6e53e3007c fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow 2026-06-01 04:59:06 +00:00
c2e3d15633 NVFP4-1.1 integration: GPU-only quantize kernel + MoE pipeline wiring
- Add quantize_nvfp4.cu: BF16→FP4 GPU kernel (no CPU sync, warp shuffle amax)
- Add quantize_nvfp4_gpu() bridge in ops/quantize.py
- Fix deinterleave_quantize kernel path (dsv4/ops/kernels → dsv4/kernels/cuda)
- Wire GPU quantize into Nvfp4MoE._run_impl():
  - L1 input: quantize_nvfp4_gpu (replaces quantize_activation_nvfp4)
  - Fused SwiGLU L2: deinterleave_quantize_nvfp4_cuda (single kernel)
  - Non-fused L2: quantize_nvfp4_gpu
- Add test_nvfp4_gpu_quantize.py for both kernels
2026-05-25 16:19:07 +00:00
401e24768a fix: import ceil_div in quantize.py (was NameError at runtime) 2026-05-23 08:40:24 +00:00
3fb3c925af Restructure: cutedsl/ -> dsv4/ with proper layering
- Split bridge.py -> ops/quantize.py, ops/layouts.py, ops/gemm_runner.py
- Renamed classes: CuTeDSLNvfp4Linear -> Nvfp4Linear, etc.
- Moved kernel code to dsv4/kernels/ (gemm, attention, compressor, decode, cuda)
- Moved PyTorch bridges to dsv4/ops/
- Moved nn.Module layers to dsv4layers/
- Moved reference implementations to dsv4/reference/
- Moved vendored CUTLASS code to vendored/
- Archived ~190 debug tests to tests/archive/
- Kept ~15 canonical tests in tests/unit/
- Updated all import paths
- Added stubs for future components (model/, cache/, loader/)
- Updated pyproject.toml: dsv4-inference package name
2026-05-21 17:30:44 +00:00