E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test
E1: LayerCacheHandle now exposes gather_compressed_kv,
gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim.
Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu.
Python wrapper in dsv4/kernels/cache/gather.py.
E2: tests/e2e/test_one_layer.py — SWA path smoke test.
E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs
for CSA/HCA compress_and_store, compute_index_scores_topk).
E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path.
Error checking via C API return code instead.
Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).