This website requires JavaScript.
Explore
Help
Register
Sign In
biondizzle
0 Followers
·
0 Following
Joined on
2025-12-10
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
User to block:
Optional note:
The note is not visible to the blocked user.
Cancel
Block
Repositories
25
Projects
Packages
Public Activity
Starred Repositories
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 22:05:23 +00:00
2a6f9a10b1
lm_head: fall back to BF16 F.linear for stability
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:59:25 +00:00
9bad30c777
Add logits validation debug before topk sampling
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:34:04 +00:00
9fec7d609e
Fix gsa_buffer shape mismatch for MoE (M>1 rows)
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:26:55 +00:00
cacf64232e
CRITICAL FIX: fused_amax_quantize cross-CTA race condition
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:18:44 +00:00
e3412cf913
P5: In-place RoPE — no x.clone(), no empty_like allocation
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:18:31 +00:00
00746c2d2b
Fix module path: move loader code from __init__.py to loader.py
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:11:03 +00:00
230d28e562
Fix KVCache constructor call — device as keyword arg, not positional
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:08:02 +00:00
c9b92cd840
Remove P1 from audit — multi-GPU layout is correct for the reference script
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:05:07 +00:00
c8faf20a99
P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 21:02:08 +00:00
e0607c9e2f
P0: Add fused_amax_quantize.cu kernel + CUDA module loader with compile-once caching
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 20:55:46 +00:00
d279965db4
Update PERFORMANCE_AUDIT.md: remove invalidated items, add WIP status
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 20:49:57 +00:00
60715f89bc
Fix CUDA kernel compilation: use c10::cuda::getCurrentCUDAStream
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 20:42:54 +00:00
2dc5b4ec19
Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 20:40:31 +00:00
360f76b970
Performance audit fixes: eliminate CPU-GPU syncs
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 20:30:03 +00:00
4f698baa5d
Production fused CUDA sampler + decode loop optimizations
biondizzle
pushed tag
v-e2e-nvfp4-all-projections
to
biondizzle/nvfp4-megamoe-kernel
2026-06-01 20:21:14 +00:00
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 19:51:24 +00:00
2830a3ee7c
Fix lm_head NVFP4: transpose weight and scales to match Nvfp4Linear checkpoint layout
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 19:41:23 +00:00
16b72b9581
PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 17:27:02 +00:00
9a3bb43f20
Set default max-tokens=512 for reasoning model
biondizzle
pushed to
master
at
biondizzle/nvfp4-megamoe-kernel
2026-06-01 17:25:05 +00:00
db6e3545da
Fix: add _use_runtime_gsa=True to router gate GEMM in single_shot
First
Previous
...
13
14
15
16
17
...
Next
Last