biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:44:40 +00:00
e45b94c01b Test: compare both normalized and un-normalized reference
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:43:01 +00:00
b70ab2a6ee Return o_accum directly (un-normalized merge result)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:41:45 +00:00
6111db571c Match working test: don't pass row_sums to kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:39:38 +00:00
312ac52d15 Normalize O_accum by exp(lse) before returning
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:38:05 +00:00
ddc701af9b Use exact merge formula from working test_d1_kv_merge.py
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:36:26 +00:00
8321ccf9c1 Fix production KV merge: use normalized O for log-sum-exp merge
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:34:13 +00:00
98c93c1cd8 Stage E: production attention wrapper + Python KV merge, clean fmha_smem_acc
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:39:44 +00:00
51e456df44 Slice MMA tile coords from tOgO for TMA copy
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:38:41 +00:00
1caa737b09 Move sC_flat_staged creation before const_expr guard
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:37:08 +00:00
3c9dbc0c5d Staged sC_flat with (128, pv_n_tile//2, 2) to match TMA atom
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:35:56 +00:00
de2028b106 Split sC_flat into staged layout to match TMA atom decomposition
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:34:41 +00:00
a0e9f7534b Use tCgC_epi (transformed) for GMEM side of TMA partition
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:33:32 +00:00
b02e103ac0 Add c_simple GMEM tensor (non-dynamic) for SMEM accumulator TMA store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:31:49 +00:00
2438826eee Use tma_partition with group_modes on both sC_flat and gO
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:30:52 +00:00
603f52de78 Fix gO creation: use slice_(pv_mma_tiler) like fmha.py
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:29:52 +00:00
b39d7f1a14 Try cute.copy(tma_c, sC_flat, gO) directly
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:28:45 +00:00
2af767a90c Try full tensor TMA copy without slicing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:27:52 +00:00
7d14a2f764 sC_flat with simple (128, pv_n_tile) layout for full epi_tile coverage
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:26:51 +00:00
6fb0e6a417 Use sC_flat (non-swizzled epi_s layout) for TMA store from SMEM accumulator
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 05:25:36 +00:00
4a2a06f9e1 Fix gO slice: use separate Int32(0) instead of tuple