biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:24:41 +00:00
bf36979a8d Use CUTLASS FMHA reference pattern for sC->GMEM TMA store (flat_divide + tma_partition)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:15:49 +00:00
97bc6d8d2f Add c_direct GMEM tensor for direct writes in SMEM accumulator path
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:14:35 +00:00
3d349b497b SME accumulator: direct GMEM write from sO_acc (bypass TMA for multi-kt)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:13:39 +00:00
7d1e0a605d Different coordinate dims for bSG_sC (2D) and bSG_gC (3D)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:13:00 +00:00
75b272c5f2 2D coordinate for bSG_sC TMA copy
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:12:15 +00:00
72dff90165 3D coordinate for bSG_sC/gC TMA copy
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:11:27 +00:00
b8b6e8cc0b Slice bSG_gC MMA tile coords for TMA copy
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:10:42 +00:00
754740d5e5 Try bSG_sC[(None, 0)] for TMA copy coordinate
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:09:57 +00:00
23a2b49daf Add SMEM accumulator for n_kv_tiles>1: O load from TMEM, accumulate in sO_acc, TMA store from sC
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:06:54 +00:00
a858ed1c14 Fix test: normalize=False for un-normalized O comparison
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:05:43 +00:00
2e262d2b99 Reset fmha_smem_acc.py to working fmha.py base
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:05:03 +00:00
b43ffe9dac Guard sO_acc allocation/zero-init with n_kv_tiles>1
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:02:52 +00:00
101840c78c Guard SMEM accumulation with n_kv_tiles>1 to avoid TMEM destructive read
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:01:41 +00:00
02a34512cb Use epilogue_tma_store for n_kv_tiles=1; TODO for multi-tile
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:00:40 +00:00
4652cab8b4 Fix: 3D coords for TMA copy (bSG_sC has 3 modes)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:00:05 +00:00
b0ebf41ee3 Slice bSG_gC with mma_tile_coord (like epilogue_tma_store)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 04:59:31 +00:00
eb0bf0cce0 Fix TMA store: use bSG_sC[(None,0)] indexing pattern from epilogue_tma_store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 04:58:48 +00:00
7ea77a121f Use cpasync.tma_partition for SMEM->GMEM TMA store (like epilogue_tma_store)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 04:57:41 +00:00
e614d0894c Clean up SMEM acc epilogue: flat indexing sO_acc->sC, TMA store from sC_s0
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 04:55:20 +00:00
1724eeb8ec Fix TMA store: use epi_s view of sC for proper layout compatibility