biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:49:47 +00:00
45cf89a556 fix: use TMEM round-trip normalize + epilogue_tma_store (known ~3% error)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:45:32 +00:00
350c7c36ac fix: correct bSG_gC indexing (6 modes)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:44:49 +00:00
6318b4da29 diag: print bSG shapes for TMA store indexing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:44:02 +00:00
28060dd944 fix: typo from_dlcap -> from_dlpack
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:41:09 +00:00
048a546e76 fix: correction_epilog with paired atoms + pre-partitioned TMA store outside if block
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:37:51 +00:00
0700745852 test: NO-OP round-trip + normalize at n=128 and n=256
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:34:34 +00:00
2ebfcb2278 fix: correction_epilog with paired atoms + pre-partitioned TMA store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:32:41 +00:00
49bf6e8294 diag: NO-OP round-trip before normalize on 2D pattern
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:31:30 +00:00
6cf1f17904 fix: O rescale uses 2D register tensor pattern, remove fence_view_async_tmem_load
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:26:59 +00:00
7842d86294 fix: use paired atoms for correction_epilog + cute.copy TMA store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:25:47 +00:00
1f4e40decc diag: add CUDA_LAUNCH_BLOCKING for crash debug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:24:38 +00:00
728a24db6a fix: inline epilogue_tma_store with inv_row_sum multiply using paired atoms
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:23:17 +00:00
0ecde542f1 fix: use cute.copy instead of cpasync.copy for TMA store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:19:42 +00:00
702bf8aa29 fix: correction_epilog with get_tmem_load_op paired atoms + direct TMA store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:15:31 +00:00
ea66b6ee8d diag: NO-OP TMEM round-trip test — load+store back unchanged
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 02:13:53 +00:00
6ee28d8423 fix: inline epilogue with paired atoms + inv_row_sum normalize, no TMEM round-trip
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:41:37 +00:00
043b66406a fix: all epilogue warps do TMA store, no dynamic if inside method
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:40:14 +00:00
db3572bafb fix: correction_epilog with get_tmem_load_op paired atoms, no TMEM round-trip
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:36:29 +00:00
d99a90ade5 fix: use attn_raw (not softmax'd) for unnorm computation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:35:20 +00:00
7becdaf739 diag: skip kernel normalize, do Python-side normalize to isolate TMEM round-trip issue