Files
nvfp4-megamoe-kernel/dsv4
biondizzle 223c22488f Simplify prefill PV read: use decode kernel's exact pattern
Replace complex n_sub-iterating read with the same HD/8 iteration
pattern as the proven decode kernel. Extract from lane qr%32 instead
of always lane 0. For qr>=32, use warp 1; for qr>=64, add TMEM offset.

This should fix the row 1 accuracy issue (was cos=0.94 vs decode).
2026-06-03 03:22:49 +00:00
..