nvfp4-megamoe-kernel

Files

biondizzle afb82b9c89 Fix B2 indexer: replace broken 16x256b TMEM read with proven 32x32b.x8

ROOT CAUSES:
1. tcgen05.ld.16x256b.x1 was hanging — either invalid instruction or unaligned
2. TMEM_COLS=128 was too small for 64-row MMA output (needs 256 for 2 row-groups)
3. TMEM row-group addressing: rows 32-63 are at offset SK_TILE (128) in TMEM

Fixes:
- Use tcgen05.ld.32x32b.x8 (proven in B1 FMHA) instead of 16x256b.x1
- Increase TMEM_COLS from 128 to 256
- Read both row-groups (0-31 and 32-63) per 8-column chunk
- Each lane handles head i (from row-group 0) and head 32+i (from row-group 1)
- Warp-level reduce sums contributions from all 64 heads per column

2026-06-03 00:39:49 +00:00

_archive

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

cache

Cleanup Step 2: Archive Lineage P code, fix broken imports

2026-06-02 19:27:07 +00:00

kernels

Fix B2 indexer: replace broken 16x256b TMEM read with proven 32x32b.x8

2026-06-03 00:39:49 +00:00

layers

P5 integration + B3 q_a_norm fused + gsa scalar fix