biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:53:46 +00:00

2f2259395e fix: always normalize in kernel, correct KV merge with normalized O + LSE

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:51:25 +00:00

914f76d30c multirow: add normalize flag, un-norm + LSE output, multi-tile KV merge test

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:48:56 +00:00

ca5cf0e517 test: add multi-head and batched prefill tests for multirow kernel

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:46:48 +00:00

ac8fa779e2 fix: move epilogue TMEM loads outside my_row_active guard (warp-collective hang)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:22:03 +00:00

55c0604a71 add fence.sc.gpu between PV and epilogue for TMEM visibility

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:19:05 +00:00

52809b0ec6 fix: tcgen05.wait::ld.sync.aligned (was missing 'sync')

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:18:22 +00:00

0220e51d18 fix: typo cudaErrorCudaSuccess -> cudaSuccess

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:17:45 +00:00

468614a4e2 fmha_multirow: non-interleaved design — softmax first, then PV

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:15:38 +00:00

c768abed95 test: softmax-only kernel (QK + row_max, no PV)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:13:33 +00:00

43ba672e15 fmha_multirow: add fence.sc.gpu after QK GEMM for TMEM visibility

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:10:52 +00:00

d840fbbf85 test: clean multirow test with proper SMEM calc

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:09:03 +00:00

f2124b9378 fix: SMEM calc in decode test

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:08:35 +00:00

58ff781388 test: simplified decode kernel for debugging multirow

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:07:32 +00:00

be2685e9e3 fmha_multirow: use natural 4-warp TMEM partitioning after UMMA

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:06:09 +00:00

ff8c677486 fix: SMEM size for MMA test — account for both sQ0 and sK0

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:05:32 +00:00

fee022a485 test: MMA→4-warp read using proven fmha_common+umma_desc infra

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:03:52 +00:00

e1a708a187 test: try 16x256b.x1 with column step=4 (4 cols per read)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:03:13 +00:00

95003eced2 test: 16x256b.x1 loads with uint32_t regs, matching working pattern

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:02:26 +00:00

fffb493b0e fix: 16x256b.x1 load syntax — single address operand

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-28 23:02:05 +00:00

44dcd6e8d0 test: 16x256b.x1 multiple LOADS — do they crash like stores?