biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 22:46:23 +00:00
74145a31cc feat: V TMA loads in multi-tile kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 22:42:48 +00:00
680d2ebf64 test: V TMA diagnostic — isolate V TMA descriptor issue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 20:02:01 +00:00
077fbdf3c5 test: HD=128/256 multi-tile variants
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:59:24 +00:00
7df17384fd test: multi-tile s_k=128/256/384/512
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:57:57 +00:00
d47b2bfcce fix: use un-normalized P for multi-tile PV (correct online softmax merge)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:56:37 +00:00
43ae3e7f98 fix: reload Q per-K-sub-tile in multi-tile kernel (same as single-tile)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:53:04 +00:00
7598d548ee debug: test multi-tile with s_k=128 only
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:49:55 +00:00
8e99bd50e6 feat: 6-warp TMA multi-tile KV kernel with register accumulator + test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:47:50 +00:00
1814510195 wip: add n_kv_tiles param for multi-tile KV (not yet used)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:45:46 +00:00
d20792aa9d fix: TMA descriptor index for batched multi-head (batch*n_h + head)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:44:59 +00:00
754c6a692c feat: per-head TMA descriptors for multi-head FMHA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:43:42 +00:00
9eb193458e test: refactored multi-row TMA test with multi-head and batch
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:41:43 +00:00
832a04181d test: relax relative error threshold to 5% for BF16, use cosine > 0.999 as pass criterion
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:40:34 +00:00
bfef94f5d0 test: HD=128/256 multi-row TMA FMHA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:39:19 +00:00
a1b2ab79a1 feat: 6-warp TMA FMHA multi-row kernel + test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:36:43 +00:00
d0a50f1f2e fix: remove double normalization in TMA epilogue (P already normalized before PV)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:35:47 +00:00
fb971781aa fix: revert V to direct load (V TMA needs debugging), K TMA works
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:34:49 +00:00
cd2c028b39 feat: TMA loads for both K and V in 6-warp FMHA kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:32:52 +00:00
523d3838a2 test: HD=128/256 variants for TMA FMHA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-29 19:32:19 +00:00
bd4f09d514 fix: ambiguous MMA_K_BF16 in test