biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 17:55:13 +00:00
755f9ad567 debug: fix per_expert_alpha ref + clean up BF16 reference scaling
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 17:02:42 +00:00
de8acc7965 debug: dump raw GEMM inputs + first 8 output values
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 15:49:01 +00:00
9159cb6bb3 docs: add debug log — current state, hypotheses, fixes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 15:46:18 +00:00
2fd55a94c6 fix: weight reshape bug + igs double-count in BF16 reference
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 14:16:25 +00:00
c421a668f3 debug: BF16 reference GEMM + cosine comparison for L1
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 13:41:10 +00:00
995589ac8a debug: add FP4 quantization round-trip diagnostic
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 13:22:00 +00:00
d0ed3d84a8 debug: add L2, SiLU, and scatter pipeline prints
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 12:50:16 +00:00
da5572f497 clean: remove diagnostic scripts from repo
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 12:42:56 +00:00
fd59222fc0 fix: stop folding global scale into float8 block scales
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 11:44:39 +00:00
56e62e916d revert: idx2crd remap approach — source-first needs hierarchical coords
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 11:39:58 +00:00
d5949a23b4 fix: use cute::crd2idx for SF remap — layout_sf() not directly callable
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 11:38:19 +00:00
9908fd64d9 feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 11:32:16 +00:00
a37a155bae WIP: remove prepack cache, remap SFB per-call inside CUTLASS
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 11:27:16 +00:00
19924275bc WIP: remove prepack cache, remap SFB per-call inside CUTLASS
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 10:56:32 +00:00
74a4475e5b WIP: remove prepack cache, remap SFB per-call inside CUTLASS
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 10:48:36 +00:00
4fed910c9c WIP: remove prepack cache, remap SFB per-call inside CUTLASS
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 10:38:47 +00:00
7adfaef113 fix: in-place prepack to avoid 2× peak memory
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 10:34:21 +00:00
90313f3a92 fix: LRU(2) eviction for prepack cache — prevents OOM across 61 layers
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 10:34:21 +00:00
5dc18df494 feat: MEGA_MOE_PREPACK_CACHE_MAX env var (default 2) with CUDA graph warning
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-15 10:14:32 +00:00
1da6726a86 fix: assert float8_e4m3fn dtype in _prepack_weight_sf