MMA must wait for softmax to produce P in TMEM before starting PV. Without this, MMA reads stale P data from TMEM, causing deadlock. softmax_done_bar: softmax warps arrive after P store, MMA waits before PV.
MMA must wait for softmax to produce P in TMEM before starting PV. Without this, MMA reads stale P data from TMEM, causing deadlock. softmax_done_bar: softmax warps arrive after P store, MMA waits before PV.