The SWA KV cache uses fp8_ds_mla packed layout (37376 bytes per slot, not 512). Our naive FP8 quant + write had a shape mismatch. Fix: skip the SWA cache write entirely. The compressor (Triton) handles the compressed cache. For full SDPA attention, we use the raw kv tensor directly — we don't need the paged cache at all during prefill.