Files
nvfp4-megamoe-kernel/dsv4
biondizzle 6c73069cb9 D5b: Per-row LSE output + Python KV merge test
- Fix LSE output: all 128 rows now write (mLSE[sfw_idx, 0, 0])
  instead of only row 0 (mLSE[0])
- Each softmax thread (sfw_idx 0..127) independently writes its LSE
- This enables accurate Python-side KV merge for multi-KV-tile
- New test: test_d5b_perrow_lse.py with LSE verification + KV merge
2026-05-26 10:57:54 +00:00
..