biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:49:17 +00:00

210391e571 fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:47:35 +00:00

824d054ad7 fix: inside cute.compile args are already CuTe tensors, no conversion needed

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:46:37 +00:00

6375e54396 fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:44:54 +00:00

cb2ca8591f fix: add @cute.jit to router compiled function

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:44:03 +00:00

d5d2b7b4b8 fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:39:53 +00:00

157f1c5258 fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:36:03 +00:00

1dbc57e2cd fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:34:55 +00:00

d05dd50bf5 fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:33:34 +00:00

a6a8755439 single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:28:19 +00:00

80002f2efc single_shot: production NVFP4 GEMM for ALL attention projections

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:21:12 +00:00

32efd5139d Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:20:04 +00:00

e45c0ff51b single_shot: use reference dequant for attn projections, focus on MoE+FMHA

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:18:37 +00:00

dfbffa1df1 single_shot: CUDA_LAUNCH_BLOCKING for debugging

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:17:48 +00:00

a66fdf6049 single_shot: add sync to catch CUDA errors early

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:16:47 +00:00

0b35c36d23 single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:14:26 +00:00

050b5ee449 Fix n_h reference before assignment in single_shot

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:12:23 +00:00

c5adbbfde6 FMHA sink: don't double-scale sink bias

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:11:13 +00:00

4adee1207f FMHA: zero-init my_p_vals to fix N<128 padding NaN

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 23:10:22 +00:00

13be3ad443 FMHA sink bias in kernel + single_shot production rewrite

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-31 22:55:13 +00:00

23e88638aa single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer)