biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:45:03 +00:00
d64b62bc80 test: simple (128,16) TMA desc for K sub-tile only
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:44:20 +00:00
eaf8a878cf fix: only warp 0 lane 0 issues TMA (not all lane 0 threads)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:43:45 +00:00
69bf20b09d fix: SMEM alignment in TMA K-only test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:43:08 +00:00
2c0ee69aea test: TMA K-only — proven gen pattern + TMA for K loads only
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:42:05 +00:00
9fc2d549e4 fix: warp-collective TMEM read/dealloc in minimal QK test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:40:25 +00:00
c755e6fdde fix: TMEM read/dealloc for 128-thread kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:40:14 +00:00
bd1309ba88 test: minimal QK — 128 threads, tid==0 MMA, match working gen kernel pattern
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:37:39 +00:00
39aef1284f fix: smem size in minimal QK test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:37:22 +00:00
ce89fe9170 test: minimal QK — separate sQ0/sK0, clean SMEM layout
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:35:01 +00:00
71b353577d fix: QK direct test — per-K-sub-tile Q load (same as working kernel)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:33:40 +00:00
35d0596893 fix: T=1 for QK direct test (write_q_to_smem only handles row 0)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:32:28 +00:00
bee7cc5f8f fix: lane vs threadIdx.x in direct QK test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:31:59 +00:00
670599b754 test: direct QK GEMM — baseline for TMA comparison
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:30:54 +00:00
9a185f0222 test: debug Q SMEM canonical after TMA load
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:29:51 +00:00
1500020593 test: QK-only TMA test — isolate TMA load + canonical + MMA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:28:48 +00:00
204cc90808 fix: load full Q (128,HD) once before QK loop — not per K-sub-tile
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:27:08 +00:00
bf7cf54a51 fix: align TMA SMEM to 128 bytes in verification test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:26:39 +00:00
befc2c647b test: TMA load verification — compare against direct GMEM read
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:25:48 +00:00
8e09fae3a1 fix: warp-stride for TMA canonical writes — only load warp calls them
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:24:49 +00:00
3e14a25bb0 fix: don't re-init mbarrier in loop — use phase parity tracking