Single-kernel NVFP4 block-scaled GEMM + fused sqrt(softplus) + top-k
epilogue. Avoids materializing intermediate FP32 logits to GMEM.
Architecture: 6-warp specialization
- Warp 5 (TMA): Load A, B, SFA, SFB from GMEM → SMEM
- Warp 4 (MMA): NVFP4 block-scaled GEMM → FP32 accumulator in TMEM
- Warps 0-3 (EPI): TMEM → registers → sqrt(softplus) + bias + top-k → GMEM
Epilogue maintains per-thread min-heap across N subtiles, then
merges all 128 threads' heaps in SMEM for final top-k selection.
Mirrors Sm100BlockScaledPersistentDenseGemmKernel structure for
TMA/MMA/SFA/SFB handling, with custom top-k epilogue replacing
the standard SwiGLU + TMA store path.
NOTE: This is WIP — needs compilation testing on B200. Several
API details (tiled_mma_sfb, cluster_layout_sfb_vmnk) need to
be passed through the kernel parameters properly.
The attention output projection first half (wo_a) was using BF16
grouped BMM (torch.bmm). Now uses production Nvfp4GroupedLinear
which performs the same grouped GEMM with NVFP4 tensor-core
acceleration on Blackwell.
The weight is loaded from NVFP4 checkpoint if available, otherwise
quantized from BF16 via set_bf16_weight().
Also includes:
- NVFP4 gate projection for router (from previous commit)
- Compressor position_bias in CUDA kernel (from earlier fix)
The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate
projection when NVFP4 scales are available in the checkpoint. This
replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core
NVFP4 acceleration.
Changes:
- dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate
fallback. New load_nvfp4_gate() method.
- dsv4/kernels/router/dense_router_decode.py: add
dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk
- dsv4/kernels/router/__init__.py: export new function
- single_shot_inference.py: load NVFP4 gate weights when available,
fall back to BF16 when not
The compressor_reduce.cu kernel now adds position_bias to BOTH kv and
gate values, matching the PyTorch reference. Previously the kernel only
added it to gate, and a Python workaround loop was adding it to both
before the kernel call (then passing None to the kernel).
Changes:
- compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA)
- single_shot_inference.py: remove Python position_bias loop, pass
self.ape directly to csa/hca_compress_production
- production_compress.py: already supports position_bias passthrough
- New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm
One block per compressed entry, 128 threads, FP32 accumulation
CSA: overlapping Ca/Cb streams (2m tokens per block)
HCA: single stream (m tokens per block)
Includes apply_kv_norm kernel (unweighted RMSNorm + weight)
- New production_compress.py: Python wrapper for CUDA kernels
- single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear
for kv_proj, gate_proj, q_b_proj, weights_proj projections
Then CUDA reduce kernel for softmax + weighted sum
No more PyTorch reference nvfp4_linear_ref in compressor/indexer path