Removed all dead code from the first (broken) attention loop approach. Clean pipeline: SMEM attention → TMEM write → TMEM read → normalize → GMEM. Also renamed sPvBuf to sO for clarity (same as reference kernel).
Removed all dead code from the first (broken) attention loop approach. Clean pipeline: SMEM attention → TMEM write → TMEM read → normalize → GMEM. Also renamed sPvBuf to sO for clarity (same as reference kernel).