biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:53:46 +00:00
2f2259395e fix: always normalize in kernel, correct KV merge with normalized O + LSE
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:51:25 +00:00
914f76d30c multirow: add normalize flag, un-norm + LSE output, multi-tile KV merge test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:48:56 +00:00
ca5cf0e517 test: add multi-head and batched prefill tests for multirow kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:46:48 +00:00
ac8fa779e2 fix: move epilogue TMEM loads outside my_row_active guard (warp-collective hang)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:22:03 +00:00
55c0604a71 add fence.sc.gpu between PV and epilogue for TMEM visibility
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:19:05 +00:00
52809b0ec6 fix: tcgen05.wait::ld.sync.aligned (was missing 'sync')
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:18:22 +00:00
0220e51d18 fix: typo cudaErrorCudaSuccess -> cudaSuccess
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:17:45 +00:00
468614a4e2 fmha_multirow: non-interleaved design — softmax first, then PV
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:15:38 +00:00
c768abed95 test: softmax-only kernel (QK + row_max, no PV)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:13:33 +00:00
43ba672e15 fmha_multirow: add fence.sc.gpu after QK GEMM for TMEM visibility
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:10:52 +00:00
d840fbbf85 test: clean multirow test with proper SMEM calc
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:09:03 +00:00
f2124b9378 fix: SMEM calc in decode test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:08:35 +00:00
58ff781388 test: simplified decode kernel for debugging multirow
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:07:32 +00:00
be2685e9e3 fmha_multirow: use natural 4-warp TMEM partitioning after UMMA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:06:09 +00:00
ff8c677486 fix: SMEM size for MMA test — account for both sQ0 and sK0
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:05:32 +00:00
fee022a485 test: MMA→4-warp read using proven fmha_common+umma_desc infra
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:03:52 +00:00
e1a708a187 test: try 16x256b.x1 with column step=4 (4 cols per read)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:03:13 +00:00
95003eced2 test: 16x256b.x1 loads with uint32_t regs, matching working pattern
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:02:26 +00:00
fffb493b0e fix: 16x256b.x1 load syntax — single address operand
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:02:05 +00:00
44dcd6e8d0 test: 16x256b.x1 multiple LOADS — do they crash like stores?