The C9 TMEM round-trip IS modifying O (confirmed by epilogue * 2.0 test).
But inv_row_sum is wrong: each thread computes row_sum via .reduce(MAX) and
packed f32x2 reduction, but the result appears to be the same for all threads.
Next: need to dump the QK C-fragment coordinate tensor to understand
which rows each thread actually owns in the TMEM load partition.