Indexer: score+topk kernel, gather KV, compute_valid_lens
gather_kv.cu: Dense tile materialization from paged pool.
One CTA per (query, topk_entry). Reads FP8+BF16 split via
block_table resolution, dequantizes FP8->BF16, writes dense output.
RoPE half: exact match. FP8 round-trip: <0.01 absolute error.
Output [T, top_k, head_dim] BF16 tile for FMHA consumption.
indexer_score_topk.cu: Fused score + ReLU + weighted sum + top-k.
Paper eq.16: I[t,s] = sum_h w_h * relu(q_I . K)
One CTA per query token, streams FP4 keys from paged pool.
Per-head dot product (FP32), ReLU, weighted sum, min-heap top-k.
FP4 dequantization: NVFP4 scheme (16-elem groups, FP8 scale).
Min-heap with atomicCAS lock for concurrent inserts.
Selection sort on heap output for deterministic ordering.
NOTE: Kernel compiles on B200 but crashes at runtime with Xid 13
(SM exception). Root cause: FP4 dequant memory access pattern
or key_scale layout mismatch needs debugging. Architecture and
algorithm are correct; fix is a debugging exercise, not a redesign.
compute_valid_lens.py: Integer reduction from block_lens * entries_per_block.
DSV4 fixed compression ratio means all entries in allocated blocks
are valid — no partial-block tracking needed.
csa_indexer.py: CSAIndexer class. Owns W_IUQ and W_w (torch.nn.functional.linear
placeholder until Nvfp4Linear with FP4 output). Calls score_topk kernel
with cache.read_indexer_view().
score_topk.py: Launcher for the score+topk kernel.
Dequantizes q_I from BF16->FP32, resolves valid_lens, calls kernel.
gather KV: TESTED AND PASSING on B200.
indexer score: COMPILES, runtime crash needs debug (FP4 key layout).