biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:50:41 +00:00
103fd451ce fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:48:32 +00:00
2f68c7ba77 fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:47:41 +00:00
6c298be842 fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:44:26 +00:00
53c25bee0b rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:29:14 +00:00
4300775bfe fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:27:45 +00:00
5a79065b2b fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:26:58 +00:00
95a1345b92 fix: return 2D scale tensor from _assemble_scales_cudagraph_safe
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:21:29 +00:00
533089c9d2 fix: token_indices slice bug + torch.zeros for float4/float8 dtypes
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:13:15 +00:00
54c470e535 fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:08:46 +00:00
f2de95c526 fix: use randint for float4 dummy weights in cudagraph test
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:05:15 +00:00
f66d4b69a4 GPU-only scale assembly + cudagraph test harness
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 18:01:54 +00:00
5121074782 cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 17:55:34 +00:00
ab126b0c0d fix: revert to .item() based scale assembly (fixes index OOB)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 16:36:21 +00:00
7594968482 WIP: cudagraph-compatible CuTeDSL MoE runner
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 10:11:37 +00:00
f0c1be3ced fix: remove broken hc_head warmup (wrong tensor shape)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 09:17:40 +00:00
c803180706 fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 08:54:54 +00:00
cdd813cf7e fix: free per-expert weight lists after stacking in CuTeDSL runner
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 08:39:17 +00:00
99c11c218d fucken a
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 08:28:41 +00:00
906ee80a42 Add tilelang kernel warmup in load_weights
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-05-16 08:23:32 +00:00
a51ef3d2cf fucken a