biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 06:44:40 +00:00

e45b94c01b Test: compare both normalized and un-normalized reference

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 06:43:01 +00:00

b70ab2a6ee Return o_accum directly (un-normalized merge result)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 06:41:45 +00:00

6111db571c Match working test: don't pass row_sums to kernel

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 06:39:38 +00:00

312ac52d15 Normalize O_accum by exp(lse) before returning

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 06:38:05 +00:00

ddc701af9b Use exact merge formula from working test_d1_kv_merge.py

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 06:36:26 +00:00

8321ccf9c1 Fix production KV merge: use normalized O for log-sum-exp merge

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 06:34:13 +00:00

98c93c1cd8 Stage E: production attention wrapper + Python KV merge, clean fmha_smem_acc

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:39:44 +00:00

51e456df44 Slice MMA tile coords from tOgO for TMA copy

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:38:41 +00:00

1caa737b09 Move sC_flat_staged creation before const_expr guard

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:37:08 +00:00

3c9dbc0c5d Staged sC_flat with (128, pv_n_tile//2, 2) to match TMA atom

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:35:56 +00:00

de2028b106 Split sC_flat into staged layout to match TMA atom decomposition

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:34:41 +00:00

a0e9f7534b Use tCgC_epi (transformed) for GMEM side of TMA partition

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:33:32 +00:00

b02e103ac0 Add c_simple GMEM tensor (non-dynamic) for SMEM accumulator TMA store

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:31:49 +00:00

2438826eee Use tma_partition with group_modes on both sC_flat and gO

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:30:52 +00:00

603f52de78 Fix gO creation: use slice_(pv_mma_tiler) like fmha.py

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:29:52 +00:00

b39d7f1a14 Try cute.copy(tma_c, sC_flat, gO) directly

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:28:45 +00:00

2af767a90c Try full tensor TMA copy without slicing

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:27:52 +00:00

7d14a2f764 sC_flat with simple (128, pv_n_tile) layout for full epi_tile coverage

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:26:51 +00:00

6fb0e6a417 Use sC_flat (non-swizzled epi_s layout) for TMA store from SMEM accumulator

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-27 05:25:36 +00:00

4a2a06f9e1 Fix gO slice: use separate Int32(0) instead of tuple