This website requires JavaScript.
Explore
Help
Register
Sign In
biondizzle
0 Followers
·
0 Following
Joined on
2025-12-10
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
User to block:
Optional note:
The note is not visible to the blocked user.
Cancel
Block
Repositories
25
Projects
Packages
Public Activity
Starred Repositories
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 09:18:38 +00:00
55f1ddd502
Update GETTING_CUDAGRAPH_READY.md and CUDA_GRAPH_SYNC_INVENTORY.md with full current status, multi-GPU stream fix, and next steps
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 08:29:42 +00:00
ac213bdee8
Update docs: CUDA graph capture WORKING on all 8 GPUs, 0.28s/token (2x eager)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 08:18:21 +00:00
6650f06121
CRITICAL FIX: Use explicit per-device streams for CUDA graph capture/replay on multi-GPU — fixes zero-output bug
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 08:14:55 +00:00
90ac38cde0
Add CUDA graph stream management test
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 08:13:20 +00:00
26042e3f01
Add minimal CUDA graph multi-GPU test to isolate zero-output bug
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 08:02:39 +00:00
86275851d4
Add minimal CUDA graph test per GPU during capture to isolate multi-GPU graph issue
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 07:51:25 +00:00
2cbf7a43e9
Add sync after cross-GPU copy before graph replay; remove misleading zero-input verification
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 07:40:21 +00:00
2bb52c7cae
Add per-layer graph capture verification — replay immediately and check for zeros
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 07:29:36 +00:00
5a98cc6d90
Store pre-cached norm weights on self to prevent GC during graph replay — root cause of all-zeros replay bug
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 07:19:10 +00:00
dcb2495a5b
Add graph replay debug prints for first 3 steps/layers
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 07:18:51 +00:00
16b9a4def2
Fix CUDA graph replay: set device to cuda:0 before lm_head graph replay
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 07:01:56 +00:00
f259d63930
CRITICAL FIX: SE swizzled buffers were allocated then overwritten with None — graph capture would fall through to broken Python path
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-06 07:01:18 +00:00
32902d1036
CUDA graph capture: derive q_a_dim from config, pre-cache norm weights, add buffer verification, use direct dict access for routers/moe/se
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-04 08:09:46 +00:00
64f547058e
Fix graph replay: pass q_a from Graph A output to forward_attention
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-04 06:10:04 +00:00
26da6d33af
Fix graph replay: remove extra token_id arg from forward_attention call
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-04 05:58:27 +00:00
ae26f6b83c
Fix dense router BF16 dispatch: use torch.matmul instead of F.linear
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-04 05:50:21 +00:00
e46b615873
Fix dense router BF16 dispatch for CUDA graph capture
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-04 05:13:54 +00:00
b4a59d0940
Update CUDA graph docs with current status, A/B split, buffer fixes, remaining blockers
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-04 04:49:10 +00:00
ffa7842b58
Fix dense router: run GEMM in BF16, convert to FP32 only for activation
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-04 04:32:02 +00:00
119e6d471e
Add safety check for swizzled buffers: fall through to Python path if None
First
Previous
1
2
3
4
5
...
Next
Last