biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:22:28 +00:00
bd169ccb0f fix: smart quote in fmha_tma.cuh
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 18:22:02 +00:00
345b107f4c fix: TMA mbarrier — add arrive.expect_tx (root cause of multi-warp hang)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 07:02:09 +00:00
c69f3668e1 feat: TMA async FMHA kernel — WORKING on B200
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 06:52:42 +00:00
a40c05f3f2 archive: TMA driver-API files + CUDA 13 TMA discovery notes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 06:42:08 +00:00
55f0c6267b auto: pre-test commit
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 06:35:06 +00:00
197cac875c fix: CUDA 13 TMA descriptor — 3D rank + byte strides + mbarrier byte count
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:45:54 +00:00
85cd95e609 debug: TMA context fix test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:45:07 +00:00
76c82ebdcd debug: detailed TMA descriptor debug test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:43:26 +00:00
0c9245b4d2 fix: add cuInit(0) for CUDA driver API
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:42:45 +00:00
6cc2f61431 debug: TMA descriptor dimension test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:42:04 +00:00
3412ff1a9b fix: TMA tile strides must match global strides, not tile dimensions
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:41:04 +00:00
409838ace2 refactor: per-sub-tile TMA loads with padded GMEM allocations
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:39:23 +00:00
8c17f65f5b fix: cast typo
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:39:14 +00:00
8908b697dd fix: bool type mismatch
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:38:59 +00:00
b78ebe8a9c debug: add TMA descriptor error reporting
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:37:50 +00:00
c7a6d7d231 fix: tma_mbar_init → tma_mbarrier_init (typo)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 04:36:55 +00:00
696462f07a feat: TMA async load infrastructure for FMHA kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:57:34 +00:00
d1c1eaeddc clean: remove debug prints, multirow kernel complete with multi-tile KV merge
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:55:59 +00:00
c65baabcc9 fix: V tile copy — V is (HD, SK_TOTAL) so tile columns are not contiguous
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 23:54:33 +00:00
869460a932 debug: add LSE verification and merge debug prints