- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
- Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
- Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
- TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
- On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
- No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
- Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)
- Modified: single_shot_inference.py
- Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
- Consumes FP8 indexer keys from cache without BF16 dequantization
- Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
- FP32 einsum fallback retained only for T>1 (prefill)
- Removed 'Intentional first-pass limits' section from B1 doc
(those limits ARE the correct production design, not shortcuts)