cute.arch.fmin/fmax take scalar Float32, not TensorSSA.
Replace with cute.where() and arithmetic for TensorSSA compatibility.
Also changed subtile loop to unroll=1 for cute.where() compatibility.
The __call__ method passes these 3 Optional params to self.kernel(),
but kernel() didn't accept them, causing TypeError: too many positional
arguments during cute.compile(). This was the CuTeDSL 'arg-binding bug'
blocking P0/P1.
The fill_() is a CPU→GPU scalar write (tiny cost). The optimization
was marginal and the output quality regression (CJK tokens) needs
investigation separately. P2 can re-land after the regression is
confirmed to be sampling-related (not gsa-related).
P0/P1 (fused SwiGLU) still disabled — kernel arg-binding bug unfixed.
P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers
'TypeError: too many positional arguments' during cute.compile(). The kernel
signature doesn't match the positional args being passed. This is a kernel-side
fix, not a single_shot fix. Disabled until the fused kernel is debugged.
P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup.
SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved
weight path) is wired but disabled. Will activate once kernel fix lands.
C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from
(max_context + compress_ratio - 1) // ratio per layer type.
CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries.
No more hardcoded 65536 that wastes memory on HCA layers.
C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache.
Gather writes compressed+SWA into this buffer via slice assignment.
Zero torch.cat allocations on the hot decode path.
C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed
views. Caller copies into gather_buf so no aliasing risk.
The indexer silently returning None caused CSA layers to attend over only the
SWA window (128 tokens), not the compressed sparse KV. This went undetected
because the model still produced plausible output at short context. The assert
makes any future indexer regression immediately visible.
1. Indexer.load: weights at *.indexer.kv_proj not *.indexer.compressor.kv_proj
2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim
3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd);
einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads
(paper's c_I = ihd = 128, one vector per compressed block)
Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP,
RESHAPE FAILURE, indexer load state) — served their purpose.
- grouped_linear.py: Replace .item() gsa + Python quantize with
quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups
into (G*T, D), single fused kernel launch, GPU-only gsa copy.
- single_shot_inference.py: Reduce torch.cuda.synchronize() to
every 20 steps instead of every step. Gate per-layer diagnostics
to li<3 or li>=58 (avoid 61 .item() calls per decode step).
NVFP4 quantize_from_buffer produces CUDA error on large-magnitude
inputs (|X|>500 at L60 output). BF16 lm_head is correct and only
runs once per decode step — not a bottleneck.
TODO: debug the NVFP4 path for large activations and re-enable.
The single-kernel approach used __syncthreads() for cross-CTA amax
reduction, but __syncthreads() only syncs within a CTA (same blockIdx).
CTA 0 reading s_amax[1] before CTA 1 writes = race condition = garbage gsa.
Result: residual |X| exploded to 10^37 by L0. F_attn and F_ffn were 0.0.
Fix: Two-kernel approach (correct, zero CPU syncs):
Kernel 1: amax_gsa.cu — computes gsa on GPU, returns GPU tensor
Kernel 2: quantize_nvfp4_from_buffer — reads gsa from GPU buffer
The fused_amax_quantize.cu now exports quantize_nvfp4_from_buffer and
deinterleave_quantize_from_buffer (gsa from GPU buffer, not kernel param).
Same P0 win: zero .item() syncs. Two kernel launches instead of one,
but correctness > shaving one launch.
Eliminates 183 kernel launches per decoded token from pointless memcpy.
Operates on rope dims in-place via views instead of cloning the full tensor
and allocating an empty_like buffer.
The single_shot is a reference for vLLM/SGLang integration. The layer-pipeline
sharding (gpu = li % NUM_GPUS) is the right pattern for this reference.
EP/TP sharding belongs in the actual vLLM integration, not here.
Fused kernels (zero CPU sync, single kernel launch per projection):
- fused_amax_quantize.cu: amax→gsa→quantize in one pass. Replaces two-step
compute_amax_gsa_gpu + quantize_nvfp4_gpu (had .item() sync).
- fused_deinterleave_amax_quantize.cu: Same for MoE fused_swiglu L2 path.
Deinterleave + amax + quantize in one pass. Replaces compute_amax_gsa_gpu
+ deinterleave_quantize_nvfp4_cuda (had .item() sync).
All kernel loaders use dsv4/kernels/cuda/loader.py (compile-once cache).
Was JIT-compiling on every call via torch.utils.cpp_extension.load (~100ms/call,
~500 calls/token). Now compiles once and reuses the cached module.
Updated layers:
- linear.py Nvfp4Linear._run_impl: fused kernel, gsa via GPU buffer
- moe.py Nvfp4MoE._run_impl: fused for L1 and L2 (both fused_swiglu and
non-fused paths)
- shared_expert.py: fused for L1 and L2
- quantize.py: All functions use module loader cache
- sampler.py: Uses module loader cache
- indexer/score_topk.py: Uses module loader cache
P2: Vectorized KVCache.append_swa — index_copy_ instead of Python loop.
2 kernel launches instead of 2T. No .item() in comp_pos either.
P3: Pre-allocated comp_kv buffers — O(1) append instead of O(N) torch.cat.
max_comp=32768 per layer (32MB). No more quadratic memory growth.
~486 .item() syncs per decoded token → ~0 (only argmax + token decode remain).