Key fixes: 1. GMEM tile coord: manual Int32 kv_coord (not kvh.count) 2. K GMEM slice: (None,None,0,0) keeps mode 1 free (GMEM iter) 3. V GMEM slice: (None,0,None,0) keeps mode 2 free (GMEM iter) 4. Add fence_view_async_tmem_load before O rescale for visibility