Production fused CUDA sampler + decode loop optimizations
- Add dsv4/kernels/cuda/sampler.cu: fused temperature + repetition penalty
+ top-k + top-p (nucleus) sampling, single kernel launch, zero CPU syncs
- Add dsv4/model/sampler.py: CUDASampler wrapper + PyTorch reference
- Update single_shot_inference.py:
- Use CUDASampler for non-greedy decoding (temperature=0.6, top_k=50, top_p=0.95)
- Pre-allocate decode buffers (no per-step torch.tensor allocation)
- Track thinking tokens (128821/128822) — not garbage for reasoning model
- Reduce diagnostic CPU syncs (top-5 every 5 steps, NaN check every 20)
- Add --top-k and --top-p CLI args
- Default: temperature=0.6 (was 0.0 greedy), rep_penalty=1.1 (was 1.2)