This website requires JavaScript.
Explore
Help
Register
Sign In
biondizzle
0 Followers
·
0 Following
Joined on
2025-12-10
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
User to block:
Optional note:
The note is not visible to the blocked user.
Cancel
Block
Repositories
25
Projects
Packages
Public Activity
Starred Repositories
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 23:53:17 +00:00
1d6610c46d
CUDA graph A/B split: eager-break-at-attention architecture
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 23:41:43 +00:00
800e974d20
Update CUDA_GRAPH_SYNC_INVENTORY.md with session 2 progress
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 23:17:54 +00:00
a468f72a0e
CUDA graph: Pre-allocate L1 GEMM output buffers in MoE and SharedExpert
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 22:56:22 +00:00
56b816a54f
CUDA graph: Use per-GPU position/token buffers for graph capture
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 22:26:51 +00:00
f57de06eb5
Fix grouped_linear GEMM output buffer shape and extraction
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 22:04:21 +00:00
92225b07e7
CUDA graph: Simplify to single-graph-per-layer capture (revert A/B split)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 22:02:03 +00:00
b32713c302
grouped_linear: Pre-allocate output buffer for grouped GEMM (CUDA graph capture)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 21:45:18 +00:00
676fad064f
Fix: Add out= parameter to run_fused_swiglu_grouped_gemm signature
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 21:30:26 +00:00
188ecae47f
CUDA graph: Eliminate per-step allocations in graph-captured code paths
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 21:09:15 +00:00
91c370360a
Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v3)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 20:54:21 +00:00
5c94dbbc37
Fix CuTeDSL from_dlpack device mismatch in CUDA graph capture (v2)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 20:34:27 +00:00
87b6c9932b
Fix CuTeDSL from_dlpack device mismatch inside CUDA graph capture
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 19:49:56 +00:00
2661cebe9a
Fix warmup_gsa: handle multi-element _gsa_buf (Nvfp4GroupedLinear per-group gsa)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 19:24:30 +00:00
486f74d900
CUDA graph: Implement eager-break-at-attention decoder with sub-graph A/B split
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 19:15:29 +00:00
5ea3aa3406
Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 18:08:28 +00:00
80bb27f5bf
CUDA graph: Fix gsa broadcast — contiguous for prefill, reshape for decode
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 17:53:49 +00:00
518a1d3f95
CUDA graph: Fix MoE scatter_add_ index dtype + fix second bincount
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 17:39:21 +00:00
f13a81d48b
CUDA graph: Fix per-call allocations in grouped_linear and quantize
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 17:37:06 +00:00
84655d066a
CUDA graph: Fix MoE bincount and per-call allocations (Hazard
#4
)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-03 17:20:45 +00:00
df05289d6f
CUDA graph: Fix remaining sync violations from B200 detector run 2
First
Previous
1
2
3
4
5
...
Next
Last