biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:09:35 +00:00
d77c965646 Disable O rescale + normalize, verify softmax P only
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:06:54 +00:00
dcc64dd14d FIX: O sub-tile count should be HEAD_DIM/corr_tile_size, not 128/corr_tile_size
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:06:05 +00:00
48b24ba005 Full pipeline: O rescale + final normalize with CUTLASS sub-tile approach
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:05:28 +00:00
a85894df89 Test softmax P vs unnormalized reference (no O normalize)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 19:04:39 +00:00
c0b39fc2bf O normalize using CUTLASS reference sub-tile approach
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:59:17 +00:00
3dbda0eebb Fix O normalize: use 2D register tensor indexing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:58:36 +00:00
6b61d5274c Add O normalization with sub-tile TMEM read-modify-write
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:57:01 +00:00
b936c6220d Simplify: softmax P only, no O rescale/normalize yet
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:55:32 +00:00
e2fad84205 Real softmax test built on working identity diag
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:53:01 +00:00
d3b662d3a8 CRITICAL FIX: remove extra scale_log2 in softmax (minus_row_max and acc_scale)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:51:23 +00:00
32869c7378 FIX: K slice (None,None,0,0) like working diag
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:49:34 +00:00
4b1fc7ee1f Diag: identity softmax on example6 pipeline to isolate softmax bug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:47:55 +00:00
912f92c6b5 Quick test: working v3 with n=256 multi-tile
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:47:08 +00:00
5b6392beaa DEBUG: add version marker to confirm code changes are running
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:45:57 +00:00
c7d55a5f49 CRITICAL FIX: TMA pre-slice (None,0,None,0) → (None,None,0,0) to keep GMEM tile dim free
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:43:57 +00:00
f734610268 Diag: TMA shapes with hardcoded major modes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:43:32 +00:00
18a589347c Diag: simplified TMA shape analysis
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:43:07 +00:00
7ad4ddb6ba Diag: print TMA partition shapes for multi-tile debugging
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:26:33 +00:00
67c5a0928d FIX: Use Python range() in TMA warp for concrete per-iteration GMEM coords
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-22 18:25:16 +00:00
54de81985f FIX: Force SSA GMEM coord via n_kv_tiles - n_kv_tiles instead of cutlass.range kt