This website requires JavaScript.
Explore
Help
Register
Sign In
biondizzle
0 Followers
·
0 Following
Joined on
2025-12-10
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
User to block:
Optional note:
The note is not visible to the blocked user.
Cancel
Block
Repositories
25
Projects
Packages
Public Activity
Starred Repositories
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:50:41 +00:00
103fd451ce
fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:48:32 +00:00
2f68c7ba77
fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:47:41 +00:00
6c298be842
fix: use new_tensor instead of torch.tensor for cudagraph (no CPU→CUDA copy)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:44:26 +00:00
53c25bee0b
rewrite: cudagraph-safe runner - no dynamic slicing, no GPU scalar indices
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:29:14 +00:00
4300775bfe
fix: remove .item() sync in scale reshape — use padded_scales.shape[0] instead
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:27:45 +00:00
5a79065b2b
fix: GEMM output should be 2x packed N (float4_e2m1fn_x2 packs 2 per element)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:26:58 +00:00
95a1345b92
fix: return 2D scale tensor from _assemble_scales_cudagraph_safe
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:21:29 +00:00
533089c9d2
fix: token_indices slice bug + torch.zeros for float4/float8 dtypes
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:13:15 +00:00
54c470e535
fix: use float16->float8 cast for rand_sf (torch.rand doesn't support float8)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:08:46 +00:00
f2de95c526
fix: use randint for float4 dummy weights in cudagraph test
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:05:15 +00:00
f66d4b69a4
GPU-only scale assembly + cudagraph test harness
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 18:01:54 +00:00
5121074782
cudagraph-safe CuTeDSL MoE: searchsorted-based scale assembly
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 17:55:34 +00:00
ab126b0c0d
fix: revert to .item() based scale assembly (fixes index OOB)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 16:36:21 +00:00
7594968482
WIP: cudagraph-compatible CuTeDSL MoE runner
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 10:11:37 +00:00
f0c1be3ced
fix: remove broken hc_head warmup (wrong tensor shape)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 09:17:40 +00:00
c803180706
fix: handle freed weight lists in _check_runtime_supported and _run_mega_moe
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 08:54:54 +00:00
cdd813cf7e
fix: free per-expert weight lists after stacking in CuTeDSL runner
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 08:39:17 +00:00
99c11c218d
fucken a
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 08:28:41 +00:00
906ee80a42
Add tilelang kernel warmup in load_weights
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-05-16 08:23:32 +00:00
a51ef3d2cf
fucken a
First
Previous
...
117
118
119
120
121
...
Next
Last