Key changes from Mike:
1. Combined K+V TMA barrier: one acquire per kt, both cute.copys share
kvh.barrier. kvh.count naturally == kt (no interleaving problem).
tx_count = K_bytes + V_bytes. Also fixes the sK[0]/sV[1] slot quirk.
2. final_o_bar NamedBarrier: MMA .arrive() after acc_pipe.producer_tail;
softmax .arrive_and_wait() before reading O for normalize. Prevents
softmax racing MMA's PV[N-1] on the final O read.
3. acc_pipe producer in MMA: producer_acquire before loop, commit+advance
after loop, producer_tail after. Consumer in epilogue as before.
4. O rescale re-enabled for kt>0 with acc_scale before softmax_done_bar.