ROOT CAUSE of TMET hang: tcgen05.fence.cta_group::1.sync.aligned is
NOT a valid PTX instruction. The correct TMEM ordering primitives are:
- tcgen05.wait::st.sync.aligned (wait for TMEM stores to complete)
- tcgen05.wait::ld.sync.aligned (wait for TMEM loads to complete)
Found in cutlass/arch/barrier.h fence_view_async_tmem_store/load.