biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 03:49:05 +00:00
0ecb98daee auto: pre-test commit
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 03:48:52 +00:00
6f94925491 NVFP4-1.1: fix cute.math.fmax -> cute.arch.fmax (correct CuTeDSL API)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 03:46:47 +00:00
60790564f0 NVFP4-1.1: fix test - two-pass kernel, cute.arch.store confirmed on B200
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 03:42:39 +00:00
ca9f920414 auto: pre-test commit
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 03:42:26 +00:00
a41de129cb NVFP4-1.1: fix test kernel - use cute.copy instead of cute.arch.store
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 03:40:55 +00:00
3a78bdf570 NVFP4-1.1: add CuTeDSL kernel test for FP4 quantization
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-28 03:40:08 +00:00
80b6b79f9e NVFP4-1.1: FP4 quantization primitives for CuTeDSL kernels
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 15:15:08 +00:00
b9f15c250f Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 08:31:25 +00:00
2412a5431b MQA/GQA: batch Q heads into kernel batch dim, shared K/V per KV group
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 07:12:03 +00:00
06a895ff99 Clean test suite for production attention (1/2/4 segments, multi-head)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 07:10:02 +00:00
778d9d4f4f Compile with row_sums tensor so kernel writes per-row row_sums
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 07:07:53 +00:00
0736a04d9b Fix KV merge: use NORMALIZED O (O_unnorm/row_sum) with LSE
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 07:04:41 +00:00
06e7f7ab48 Debug: print LSE values for 2-segment merge
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:58:30 +00:00
8f8d14c300 Match tensor slicing exactly to test_d1_kv_merge (2D slices, 3D unsqueeze)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:56:05 +00:00
6ee61717c0 Match tensor shapes from working test_d1_kv_merge
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:54:18 +00:00
3a25c7feff Test multi-KV merge (2 segments) separately from multi-head
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:52:45 +00:00
36a6f07a7e Fix: unsqueeze k/v when dim==2
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:51:21 +00:00
fc4172937c Clean production wrapper: always normalize=False + KV merge
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:49:03 +00:00
8f87109f86 Single-segment: use normalize=False + per-row normalization from row_sums
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-27 06:46:31 +00:00
fe55bf23a0 Split single-segment (normalized) and multi-segment (KV merge) paths