biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 20:10:38 +00:00
a54a241052 Revert TMA to kt pattern (n=128 works), multi-tile TMA is separate bug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 20:08:09 +00:00
956ca1ecfd TMA: use self.n_kv_tiles + kv_coord pattern from working diag test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 20:07:24 +00:00
242ffebcd9 REVERT to 0bdcdc0 — the version that passed n=128 cos 0.999998
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:57:54 +00:00
3814838107 DEBUG: disable O rescale + normalize, test if corr setup alone causes regression
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:55:07 +00:00
2c51f82382 Shared corr tensors for O rescale + final normalize, fix softmax loop
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:51:56 +00:00
f165257c50 Add O rescale with correction_rescale pattern + fix TMA to working diag pattern
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:50:38 +00:00
0bdcdc0efd O normalize: exact CUTLASS correction_rescale pattern with 2D reg tensor
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:49:53 +00:00
009cf9f80d O normalize: TMEM round-trip with paired Ld/St atoms + standard epilogue_tma_store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:49:16 +00:00
f821dd00fe Fix: use NamedBarrier instead of mbarrier_arrive/wait
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:48:33 +00:00
3bd406e925 Fix: barrier_wait → mbarrier_wait, barrier_arrive → mbarrier_arrive
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:47:58 +00:00
544f0ca52b Fix epilogue: corr_tile_size=16, proper epi_subtile tuple, match CUTLASS reference
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:46:01 +00:00
d526aa04fb Fix example7: K slice (None,None,0,0) and softmax scale_log2 double-bug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:25:07 +00:00
558aac0581 Fix: fence_async_shared -> fence_view_async_shared
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:24:26 +00:00
d62e6fc9ca Clean v2: real softmax P, no O TMEM modify, standard epilogue. Baseline for custom epilogue work.
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:18:18 +00:00
183292d919 O normalize using tmem_ptr base (same as epilogue) + CUTLASS sub-tile pattern
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:17:05 +00:00
365e8f53af O normalize with full layout (no sub-tiling), Repetition(64)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:14:34 +00:00
7bc94e610a Disable ALL O copies to verify baseline
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:13:45 +00:00
590d6e9fba Disable O rescale too for NO-OP test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:12:52 +00:00
51cec1405d DEBUG: O load+store NO-OP to verify TMEM copy correctness
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:11:18 +00:00
e3e2668192 Re-enable O rescale + normalize with corr_tile_size=32