CRITICAL BUG: The old kernel had __syncthreads() and a spinlock INSIDE
the strided loop over num_valid entries. When num_valid % n_threads != 0
(i.e. essentially always at production context lengths), threads that
exit the loop early deadlock on the barrier while others wait forever.
Fix: per-thread local top-k in registers (LOCAL_K=8), block-level merge
after the loop completes. No in-loop barriers, no spinlocks.
Architecture:
- Each thread maintains a private min-heap of LOCAL_K best scores
- After the strided loop (no __syncthreads inside), threads write their
local top-k to shared memory
- Thread 0 builds the final top-k from all n_threads*LOCAL_K candidates
- For top_k=1024, n_threads=128, LOCAL_K=8: 1024 candidates = exact merge
- SMEM budget: w_h + merge heap + per-thread staging = ~30KB (well under 232KB)
Also updated the copy in dsv4/kernels/cuda/ (the one actually loaded
by the Python bridge).
Future optimization (separate from this fix):
- The dot products are scalar FP32 per thread. At 1M context this is slow.
Production path should use FP4 tcgen05 MMA (Stage F).
- The block-level merge is single-threaded. Could use warp-reduce or
bitonic sort for top_k > 256.