Verifies CUDA kernel matches PyTorch reference with and without position_bias for both CSA (m=4) and HCA (m=128) paths.