biondizzle
  • Joined on 2025-12-10
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:45:59 +00:00
05cdde1676 Fix wo_a: scatter each group's data at correct offset in padded buffer
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:43:20 +00:00
8fe5546bb3 Fix debug script
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:43:06 +00:00
788f0aa65a Add step-by-step debug for wo_a
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:41:33 +00:00
5f5b997fc3 Fix wo_a: permute to groups-first layout for grouped GEMM
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:40:44 +00:00
77e4970d93 Add debug script for wo_a quantization
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:39:57 +00:00
80122b850b Add debug script for wo_a
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:37:52 +00:00
ae233ab648 Fix test: cos_sin_cache on CUDA device
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:36:35 +00:00
882d4996ff Replace DeepGEMM fp8_einsum with CuTeDSL NVFP4 for wo_a (o_proj)
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:17:55 +00:00
bab1f75f29 Fix gs None error in legacy _ensure_stacked path
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:16:44 +00:00
48fa64dfda Eliminate weight copies: pass stacked checkpoint tensors directly
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 02:08:22 +00:00
0612c1ab54 use proper backend
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:57:17 +00:00
00fe63b56f Fix compile test: add warmup for activation global scales
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:56:48 +00:00
bba3bca4d3 Add torch.compile + custom op integration test
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:54:55 +00:00
35fab6cff3 Replace autograd.Function with torch.library.custom_op for Dynamo compat
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:24:43 +00:00
98153002c0 Go back to torch.library.custom_op with correct GEMM impl
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:20:09 +00:00
02c500bbb1 Switch to allow_in_graph for Dynamo opacity instead of custom op
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:18:56 +00:00
581d87f9a6 Remove warmup forward from process_weights_after_loading
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:15:28 +00:00
5d49849156 Fix: torch.zeros doesn't support Float4_e2m1fn_x2 dtype
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:11:46 +00:00
e1fcfc4f01 Add CuTeDSL warmup + CUDA sync after JIT compilation
biondizzle pushed to proper-nvfp4-integration at biondizzle/nvfp4-megamoe-kernel 2026-05-19 01:05:23 +00:00
1d9c0f996c Fix expert_offsets dtype: CuTeDSL expects int32 not int64