nvfp4-megamoe-kernel

Files

biondizzle bb3ad3d2ef BREAKTHROUGH: cosine 0.993 for n=128! PV-partitioned P row sum works.

C9 fix: instead of using QK-partitioned row_sum (which maps to wrong PV rows),
read P from TMEM using PV partition and sum via .reduce(ADD).

QK: thread N owns row N//4, PV: thread N owns row N.
Reading P via PV partition gives each thread its correct row P values.

n=128: cosine 0.993 (was 0.514)
n=256: cosine 0.725 (C6 still broken for multi-tile)
n=384: cosine 0.676 (same C6 issue)

Remaining: C6 O-rescale for multi-tile needs same PV-partitioned fix.
Small accuracy gap (0.993 vs 0.999) likely from BF16 P store/load round-trip.

2026-05-21 20:13:51 +00:00