biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:28:33 +00:00
039c8b90ce diag: print expected unnorm P@V for comparison with raw kernel output
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:27:05 +00:00
ec5b892e32 diag: skip final normalize, test raw PV output via epilogue_tma_store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:25:54 +00:00
e50644afde fix: O rescale uses 2D register tensor pattern (matching CUTLASS correction_rescale)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:24:41 +00:00
b77c8d83f5 fix: pre-compute tmem_load_epi_atom in __call__, pass to kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:23:06 +00:00
c9271ffbf4 fix: index into TMA partitioned tensors for copy
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:22:25 +00:00
e01ff282b7 fix: use flat_divide+group_modes(0,2) for TMA store, matching CUTLASS
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:20:53 +00:00
5efa9c9297 fix: use gC not tCgC for TMA partition, group modes 0-3
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:20:10 +00:00
7a894c4bf6 fix: use tma_partition for TMA store in correction_epilog
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:19:01 +00:00
3c134f7e90 fix: replace TMEM round-trip normalize with CUTLASS correction_epilog pattern
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:17:16 +00:00
690fd77e6c diag: inv_row_sum=1.0 to test raw PV, n=128 only
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:13:31 +00:00
2b93b10199 diag: test original code n=128+256 to confirm baseline
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:12:51 +00:00
9bcddb68e1 diag: disable O rescale properly, test n=128+256 baseline
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:12:09 +00:00
0ef41266de diag: test n=128 and n=256 both with rescale disabled
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:11:27 +00:00
dc44fa187a fix: indentation error in diag disable
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:11:02 +00:00
a9cace316d diag: disable O rescale to isolate the issue (n=256 only)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 01:02:35 +00:00
08d4af90ca debug: add wide-search diagnostics for n=256 O rescale
biondizzle pushed tag v-multitile-softmax-wip to biondizzle/nvfp4-megamoe-kernel 2026-05-23 00:35:49 +00:00
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 00:35:44 +00:00
f026c1824c 🚀 MULTI-TILE SOFTMAX + O RESCALE WORKING: n=128 cos 0.999998, n=256 cos 0.80
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 00:34:38 +00:00
d511ebe387 Debug: add row_sum/inv_row_sum printf at final normalize
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-23 00:33:39 +00:00
c2ff8e072e Fix ALL loops: use self.n_kv_tiles everywhere