This website requires JavaScript.
Explore
Help
Register
Sign In
biondizzle
0 Followers
·
0 Following
Joined on
2025-12-10
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
User to block:
Optional note:
The note is not visible to the blocked user.
Cancel
Block
Repositories
25
Projects
Packages
Public Activity
Starred Repositories
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 04:28:32 +00:00
7e3fb5f4d0
fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 04:10:42 +00:00
f52eedbdce
Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 04:02:36 +00:00
668a42e71a
debug: print mhc_sinkhorn CUDA kernel compile errors
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 03:54:05 +00:00
ca53bdb8e1
perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 03:51:13 +00:00
7b82d31330
perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 03:08:37 +00:00
f0dec9f6bd
profile: fine-grained attention component timing
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 02:56:24 +00:00
7114c48575
fix: parenthesize profile_detail condition
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-02 02:46:41 +00:00
4734e894c7
profile: add per-layer attn vs ffn timing with CUDA sync
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 23:55:28 +00:00
4017ef2f16
fix: accurate profile sync + remove paris_tids 129K iteration
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 23:18:44 +00:00
73ae9393da
FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 23:14:26 +00:00
36f9782bad
Add thinking/Paris token logit check on step 0 for quality debugging
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 23:04:47 +00:00
ef7e0d63bb
Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 23:03:48 +00:00
008e59eb90
Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 23:01:37 +00:00
106f42c93c
auto: pre-test commit
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 22:33:05 +00:00
e53645654d
Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 22:32:42 +00:00
6f4bbc997a
Add sync after sampler for step<3 to catch async CUDA errors early
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 22:29:58 +00:00
5493a8727e
P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 22:21:33 +00:00
828ba73dff
Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 22:21:16 +00:00
583ad6cfe6
P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 22:06:43 +00:00
8767c263ab
Add cuda.synchronize + better logits validation after lm_head
First
Previous
...
12
13
14
15
16
...
Next
Last