- New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm
One block per compressed entry, 128 threads, FP32 accumulation
CSA: overlapping Ca/Cb streams (2m tokens per block)
HCA: single stream (m tokens per block)
Includes apply_kv_norm kernel (unweighted RMSNorm + weight)
- New production_compress.py: Python wrapper for CUDA kernels
- single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear
for kv_proj, gate_proj, q_b_proj, weights_proj projections
Then CUDA reduce kernel for softmax + weighted sum
No more PyTorch reference nvfp4_linear_ref in compressor/indexer path
- indexer/__init__.py: compute_index_scores_topk now calls
run_indexer_score_topk with proper tensor reshaping
- compressor/__init__.py: added torch import, fixed csa_compress_tail
and hca_compress_tail imports for flush.py
- Full flush pipeline now importable end-to-end