This website requires JavaScript.
Explore
Help
Register
Sign In
biondizzle
0 Followers
·
0 Following
Joined on
2025-12-10
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
User to block:
Optional note:
The note is not visible to the blocked user.
Cancel
Block
Repositories
25
Projects
Packages
Public Activity
Starred Repositories
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:53:46 +00:00
2f2259395e
fix: always normalize in kernel, correct KV merge with normalized O + LSE
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:51:25 +00:00
914f76d30c
multirow: add normalize flag, un-norm + LSE output, multi-tile KV merge test
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:48:56 +00:00
ca5cf0e517
test: add multi-head and batched prefill tests for multirow kernel
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:46:48 +00:00
ac8fa779e2
fix: move epilogue TMEM loads outside my_row_active guard (warp-collective hang)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:22:03 +00:00
55c0604a71
add fence.sc.gpu between PV and epilogue for TMEM visibility
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:19:05 +00:00
52809b0ec6
fix: tcgen05.wait::ld.sync.aligned (was missing 'sync')
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:18:22 +00:00
0220e51d18
fix: typo cudaErrorCudaSuccess -> cudaSuccess
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:17:45 +00:00
468614a4e2
fmha_multirow: non-interleaved design — softmax first, then PV
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:15:38 +00:00
c768abed95
test: softmax-only kernel (QK + row_max, no PV)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:13:33 +00:00
43ba672e15
fmha_multirow: add fence.sc.gpu after QK GEMM for TMEM visibility
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:10:52 +00:00
d840fbbf85
test: clean multirow test with proper SMEM calc
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:09:03 +00:00
f2124b9378
fix: SMEM calc in decode test
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:08:35 +00:00
58ff781388
test: simplified decode kernel for debugging multirow
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:07:32 +00:00
be2685e9e3
fmha_multirow: use natural 4-warp TMEM partitioning after UMMA
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:06:09 +00:00
ff8c677486
fix: SMEM size for MMA test — account for both sQ0 and sK0
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:05:32 +00:00
fee022a485
test: MMA→4-warp read using proven fmha_common+umma_desc infra
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:03:52 +00:00
e1a708a187
test: try 16x256b.x1 with column step=4 (4 cols per read)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:03:13 +00:00
95003eced2
test: 16x256b.x1 loads with uint32_t regs, matching working pattern
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:02:26 +00:00
fffb493b0e
fix: 16x256b.x1 load syntax — single address operand
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-28 23:02:05 +00:00
44dcd6e8d0
test: 16x256b.x1 multiple LOADS — do they crash like stores?
First
Previous
...
38
39
40
41
42
...
Next
Last