biondizzle

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:45:59 +00:00

05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:43:20 +00:00

8fe5546bb3 Fix debug script

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:43:06 +00:00

788f0aa65a Add step-by-step debug for wo_a

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:41:33 +00:00

5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:40:44 +00:00

77e4970d93 Add debug script for wo_a quantization

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:39:57 +00:00

80122b850b Add debug script for wo_a

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:37:52 +00:00

ae233ab648 Fix test: cos_sin_cache on CUDA device

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:36:35 +00:00

882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:17:55 +00:00

bab1f75f29 Fix gs None error in legacy _ensure_stacked path

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:16:44 +00:00

48fa64dfda Eliminate weight copies: pass stacked checkpoint tensors directly

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 02:08:22 +00:00

0612c1ab54 use proper backend

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:57:17 +00:00

00fe63b56f Fix compile test: add warmup for activation global scales

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:56:48 +00:00

bba3bca4d3 Add torch.compile + custom op integration test

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:54:55 +00:00

35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:24:43 +00:00

98153002c0 Go back to torch.library.custom_op with correct GEMM impl

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:20:09 +00:00

02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:18:56 +00:00

581d87f9a6 Remove warmup forward from process_weights_after_loading

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:15:28 +00:00

5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:11:46 +00:00

e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation

biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel

2026-05-19 01:05:23 +00:00

1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64