block_reduce_sum/max write to smem[0..n_warps-1] but we passed &s_amax
(single float). For 128 threads / 4 warps, this wrote 4 floats starting
at &s_amax, corrupting adjacent shared variables (s_inv_rms, s_vals).
Fix: use s_scratch[8] array (4 for sum, 4 for max) with proper sizing.
CRITICAL: quantize must use the FP8-round-tripped block scale, not the raw
pre-FP8 value. The dequant reads the FP8 bytes back, so the quantize must
match exactly. Same pattern as quantize_nvfp4.cu. This was the root cause
of cos=0.925 (should be ~0.995).
Previous version used __shfl_down_sync for group-level amax reduction,
but shuffles operate at warp level and crossed group boundaries.
Fix: each thread independently quantizes its assigned 16-element blocks
from shared memory. Simpler and correct.
_apply_rope now uses dsv4.ops.rope_cuda (1 CUDA kernel per call)
instead of PyTorch ops (5-6 kernels per call).
Total: 183 RoPE calls × (5-1) = 732 launches saved per token.
With fallback to PyTorch if CUDA kernel fails.
cute.arch.fmin/fmax take scalar Float32, not TensorSSA.
Replace with cute.where() and arithmetic for TensorSSA compatibility.
Also changed subtile loop to unroll=1 for cute.where() compatibility.
The __call__ method passes these 3 Optional params to self.kernel(),
but kernel() didn't accept them, causing TypeError: too many positional
arguments during cute.compile(). This was the CuTeDSL 'arg-binding bug'
blocking P0/P1.
The fill_() is a CPU→GPU scalar write (tiny cost). The optimization
was marginal and the output quality regression (CJK tokens) needs
investigation separately. P2 can re-land after the regression is
confirmed to be sampling-related (not gsa-related).
P0/P1 (fused SwiGLU) still disabled — kernel arg-binding bug unfixed.
P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers
'TypeError: too many positional arguments' during cute.compile(). The kernel
signature doesn't match the positional args being passed. This is a kernel-side
fix, not a single_shot fix. Disabled until the fused kernel is debugged.
P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup.
SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved
weight path) is wired but disabled. Will activate once kernel fix lands.
C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from
(max_context + compress_ratio - 1) // ratio per layer type.
CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries.
No more hardcoded 65536 that wastes memory on HCA layers.
C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache.
Gather writes compressed+SWA into this buffer via slice assignment.
Zero torch.cat allocations on the hot decode path.
C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed
views. Caller copies into gather_buf so no aliasing risk.
The indexer silently returning None caused CSA layers to attend over only the
SWA window (128 tokens), not the compressed sparse KV. This went undetected
because the model still produced plausible output at short context. The assert
makes any future indexer regression immediately visible.
1. Indexer.load: weights at *.indexer.kv_proj not *.indexer.compressor.kv_proj
2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim
3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd);
einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads
(paper's c_I = ihd = 128, one vector per compressed block)
Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP,
RESHAPE FAILURE, indexer load state) — served their purpose.