biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 17:55:13 +00:00

755f9ad567 debug: fix per_expert_alpha ref + clean up BF16 reference scaling

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 17:02:42 +00:00

de8acc7965 debug: dump raw GEMM inputs + first 8 output values

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 15:49:01 +00:00

9159cb6bb3 docs: add debug log — current state, hypotheses, fixes

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 15:46:18 +00:00

2fd55a94c6 fix: weight reshape bug + igs double-count in BF16 reference

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 14:16:25 +00:00

c421a668f3 debug: BF16 reference GEMM + cosine comparison for L1

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 13:41:10 +00:00

995589ac8a debug: add FP4 quantization round-trip diagnostic

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 13:22:00 +00:00

d0ed3d84a8 debug: add L2, SiLU, and scatter pipeline prints

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 12:50:16 +00:00

da5572f497 clean: remove diagnostic scripts from repo

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 12:42:56 +00:00

fd59222fc0 fix: stop folding global scale into float8 block scales

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 11:44:39 +00:00

56e62e916d revert: idx2crd remap approach — source-first needs hierarchical coords

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 11:39:58 +00:00

d5949a23b4 fix: use cute::crd2idx for SF remap — layout_sf() not directly callable

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 11:38:19 +00:00

9908fd64d9 feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 11:32:16 +00:00

a37a155bae WIP: remove prepack cache, remap SFB per-call inside CUTLASS

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 11:27:16 +00:00

19924275bc WIP: remove prepack cache, remap SFB per-call inside CUTLASS

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 10:56:32 +00:00

74a4475e5b WIP: remove prepack cache, remap SFB per-call inside CUTLASS

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 10:48:36 +00:00

4fed910c9c WIP: remove prepack cache, remap SFB per-call inside CUTLASS

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 10:38:47 +00:00

7adfaef113 fix: in-place prepack to avoid 2× peak memory

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 10:34:21 +00:00

90313f3a92 fix: LRU(2) eviction for prepack cache — prevents OOM across 61 layers

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 10:34:21 +00:00

5dc18df494 feat: MEGA_MOE_PREPACK_CACHE_MAX env var (default 2) with CUDA graph warning

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-15 10:14:32 +00:00

1da6726a86 fix: assert float8_e4m3fn dtype in _prepack_weight_sf