biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 08:01:21 +00:00
3549a2388b fix: constexpr HD for template param
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 08:00:44 +00:00
7436315309 feat: add tcgen05.mma QK GEMM verification kernel + test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:54:03 +00:00
6fb3d54c02 docs: update here-docs with CuTeDSL rationale for NVIDIA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:49:50 +00:00
9524b674ab test: enable both reference + TMEM epilogue tests at hd=64/128
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:49:04 +00:00
446a0ca9fd refactor(tmem): clean rewrite of TMEM epilogue kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:47:39 +00:00
c989dc78d9 debug: print sPvBuf[32] value
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:46:58 +00:00
146e4f0282 debug: print NaN positions in test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:46:22 +00:00
b50f6a8512 debug: add TMEM read diagnostic
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:45:33 +00:00
a12607b0bd test: add NaN counter to FMHA test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:44:46 +00:00
53c676c8a6 test: add max_abs_diff to FMHA test output
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:44:04 +00:00
579dd061cd fix: remove duplicate TMEM_COLS_NEEDED declarations
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:43:42 +00:00
278f1b34af fix(tmem): correct lane-to-position mapping for tcgen05.ld/st
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:42:19 +00:00
593bc25afa test: add TMEM lane mapping diagnostics
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:41:18 +00:00
33cedbee0a fix(tmem): TMEM ld/st are warp-collective — ALL 32 lanes must call them
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:40:06 +00:00
cea02fe407 fix: add cstdio for printf in TMEM debug
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:39:55 +00:00
0ddcc6bafd debug: add printf to TMEM kernel to find hang point
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:38:50 +00:00
44fb04fa1f test: disable tmem epilogue test (debugging reference hang)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:36:13 +00:00
224d7e24c6 harness: add fire_b200_cuda_test + check_b200_cuda, update README
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:31:44 +00:00
cec505ce14 add CUDA test runner script (screen-based, follows harness pattern)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 07:27:36 +00:00
2eb44a00bf fix(tmem): warp-collective TMEM ops + one-way correction epilogue