biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:49:17 +00:00
210391e571 fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:47:35 +00:00
824d054ad7 fix: inside cute.compile args are already CuTe tensors, no conversion needed
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:46:37 +00:00
6375e54396 fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:44:54 +00:00
cb2ca8591f fix: add @cute.jit to router compiled function
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:44:03 +00:00
d5d2b7b4b8 fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:39:53 +00:00
157f1c5258 fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:36:03 +00:00
1dbc57e2cd fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:34:55 +00:00
d05dd50bf5 fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:33:34 +00:00
a6a8755439 single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:28:19 +00:00
80002f2efc single_shot: production NVFP4 GEMM for ALL attention projections
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:21:12 +00:00
32efd5139d Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:20:04 +00:00
e45c0ff51b single_shot: use reference dequant for attn projections, focus on MoE+FMHA
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:18:37 +00:00
dfbffa1df1 single_shot: CUDA_LAUNCH_BLOCKING for debugging
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:17:48 +00:00
a66fdf6049 single_shot: add sync to catch CUDA errors early
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:16:47 +00:00
0b35c36d23 single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:14:26 +00:00
050b5ee449 Fix n_h reference before assignment in single_shot
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:12:23 +00:00
c5adbbfde6 FMHA sink: don't double-scale sink bias
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:11:13 +00:00
4adee1207f FMHA: zero-init my_p_vals to fix N<128 padding NaN
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 23:10:22 +00:00
13be3ad443 FMHA sink bias in kernel + single_shot production rewrite
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-31 22:55:13 +00:00
23e88638aa single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer)