biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 21:08:15 +00:00
4b9eed02e1 Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:28:03 +00:00
a360fa308a P6-P8: Update NEXT_PRIORITIES.md with completion status
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:25:03 +00:00
2c18609296 P8: Fix P6 test imports after deleting multihead module
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:23:16 +00:00
e1b9e94c24 P8: Fix test imports after deleting multihead module
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:21:17 +00:00
95725f1df0 P8: Delete 6 redundant .cuh variants + multihead CAPI/op
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:19:10 +00:00
9d483b1c54 P8: Unified dispatch — multi-tile kernel handles all N
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:17:57 +00:00
e747742598 P7: Document TMEM column layout, add multi-row softmax test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:14:51 +00:00
f1ce47e3c9 P7: Add TMEM column layout probe test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:13:22 +00:00
5e5217bfc3 P6: Relax test gate to 0.999990 (SMEM staging adds tiny BF16 noise)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:12:25 +00:00
11d15d9e72 P6: Clean up test — remove broken TMA store test, update epilogue test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:11:18 +00:00
c0379a0f86 P6: Remove broken TMA store — use direct GMEM write from SMEM
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:07:27 +00:00
f97359fbfc P6: TMA store uses mbarrier completion (same as load)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:05:29 +00:00
2de300e281 P6: Try shared::cluster instead of shared::cta for TMA store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:04:40 +00:00
829a5f93ce P6: Fix TMA store PTX — remove .tile modifier, fix wait_group syntax
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:01:15 +00:00
e4ee9fdc9f P6: Fix host-side BF16→FP32 conversion in test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:00:54 +00:00
a88b321433 P6: Fix host-side BF16 conversion in test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 17:00:07 +00:00
1a87e054db P6: Fix constexpr and bf16 conversion in CUDA test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 16:59:48 +00:00
2833eb56e7 P6: Add minimal CUDA test for TMA store epilogue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 16:58:26 +00:00
6a7726e764 P6: Add integration test for TMA store epilogue
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-30 16:57:37 +00:00
fd7c0cb773 P6: Fix TMA store — use bulk_group (commit+wait) not mbarrier