Both segments (compressed+SWA with n_comp=96, and SWA-only with n_comp=0) pass individually at cos 0.999996. The Python KV merge produces the correct combined attention at cos 0.999996. Key: n_comp is compile-time, so separate kernel instances are needed for segments with different n_comp values. Production code would use a kernel cache keyed on (n_comp, apply_sink_bias, ...).