biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:50:41 +00:00

103fd451ce fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:48:32 +00:00

2f68c7ba77 fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:47:41 +00:00

6c298be842 fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:44:26 +00:00

53c25bee0b rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:29:14 +00:00

4300775bfe fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:27:45 +00:00

5a79065b2b fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:26:58 +00:00

95a1345b92 fix: return 2D scale tensor from _assemble_scales_cudagraph_safe

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:21:29 +00:00

533089c9d2 fix: token_indices slice bug + torch.zeros for float4/float8 dtypes

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:13:15 +00:00

54c470e535 fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:08:46 +00:00

f2de95c526 fix: use randint for float4 dummy weights in cudagraph test

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:05:15 +00:00

f66d4b69a4 GPU-only scale assembly + cudagraph test harness

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 18:01:54 +00:00

5121074782 cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 17:55:34 +00:00

ab126b0c0d fix: revert to .item() based scale assembly (fixes index OOB)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 16:36:21 +00:00

7594968482 WIP: cudagraph-compatible CuTeDSL MoE runner

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 10:11:37 +00:00

f0c1be3ced fix: remove broken hc_head warmup (wrong tensor shape)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 09:17:40 +00:00

c803180706 fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 08:54:54 +00:00

cdd813cf7e fix: free per-expert weight lists after stacking in CuTeDSL runner

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 08:39:17 +00:00

99c11c218d fucken a

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 08:28:41 +00:00

906ee80a42 Add tilelang kernel warmup in load_weights

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-05-16 08:23:32 +00:00

a51ef3d2cf fucken a