nvfp4-megamoe-kernel

Files

biondizzle 223c22488f Simplify prefill PV read: use decode kernel's exact pattern

Replace complex n_sub-iterating read with the same HD/8 iteration
pattern as the proven decode kernel. Extract from lane qr%32 instead
of always lane 0. For qr>=32, use warp 1; for qr>=64, add TMEM offset.

This should fix the row 1 accuracy issue (was cos=0.94 vs decode).

2026-06-03 03:22:49 +00:00

_archive

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

cache

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

kernels

Simplify prefill PV read: use decode kernel's exact pattern

2026-06-03 03:22:49 +00:00

layers

P5 integration + B3 q_a_norm fused + gsa scalar fix