TMA store uses cp.async.bulk.tensor.2d.global.shared::cta.tile.bulk_group NOT mbarrier::complete_tx::bytes. Completion tracked via: - cp.async.bulk.commit_group (after issuing stores) - cp.async.bulk.wait_group.read 0 (wait for all groups) Removed sMbarStore from SMEM allocations (no longer needed).