- Fix LSE output: all 128 rows now write (mLSE[sfw_idx, 0, 0])
instead of only row 0 (mLSE[0])
- Each softmax thread (sfw_idx 0..127) independently writes its LSE
- This enables accurate Python-side KV merge for multi-KV-tile
- New test: test_d5b_perrow_lse.py with LSE verification + KV merge