Files
nvfp4-megamoe-kernel/tests
biondizzle 73f38acf74 STAGE B BUG 4b FIXED: TMEM P/O overlap + FMHA V reconstruction
Root cause: PV output O started at TMEM column 64 (from find_tmem_tensor_col_offset),
overlapping with P at columns [32,96). PV MMA reading P while writing O to overlapping
columns corrupted the A operand mid-computation.

For (128,128) PV, O started at 128 (no overlap) so it worked by accident.
For (128,64) PV, O started at 64, overlapping P [32,96) -> NaN/garbage.

Fix: Place O at column 128 (after both S [0,128) and P [32,96)).
Also added FMHA-style V reconstruction: logical (HEAD_DIM, s_k, 1) stride (1, hd, hd*s_k)
instead of passing DLPack V directly to TMA.

test_fmha_v3.py: (128,64) PV with random V -> cosine 0.999999 PASS
2026-05-21 15:30:24 +00:00
..
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 05:08:57 +00:00
2026-05-21 10:50:30 +00:00
2026-05-17 22:58:27 +00:00
2026-05-17 07:37:47 +00:00