biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 22:05:23 +00:00
2a6f9a10b1 lm_head: fall back to BF16 F.linear for stability
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:59:25 +00:00
9bad30c777 Add logits validation debug before topk sampling
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:34:04 +00:00
9fec7d609e Fix gsa_buffer shape mismatch for MoE (M>1 rows)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:26:55 +00:00
cacf64232e CRITICAL FIX: fused_amax_quantize cross-CTA race condition
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:18:44 +00:00
e3412cf913 P5: In-place RoPE — no x.clone(), no empty_like allocation
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:18:31 +00:00
00746c2d2b Fix module path: move loader code from __init__.py to loader.py
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:11:03 +00:00
230d28e562 Fix KVCache constructor call — device as keyword arg, not positional
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:08:02 +00:00
c9b92cd840 Remove P1 from audit — multi-GPU layout is correct for the reference script
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:05:07 +00:00
c8faf20a99 P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 21:02:08 +00:00
e0607c9e2f P0: Add fused_amax_quantize.cu kernel + CUDA module loader with compile-once caching
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 20:55:46 +00:00
d279965db4 Update PERFORMANCE_AUDIT.md: remove invalidated items, add WIP status
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 20:49:57 +00:00
60715f89bc Fix CUDA kernel compilation: use c10::cuda::getCurrentCUDAStream
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 20:42:54 +00:00
2dc5b4ec19 Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 20:40:31 +00:00
360f76b970 Performance audit fixes: eliminate CPU-GPU syncs
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 20:30:03 +00:00
4f698baa5d Production fused CUDA sampler + decode loop optimizations
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 19:51:24 +00:00
2830a3ee7c Fix lm_head NVFP4: transpose weight and scales to match Nvfp4Linear checkpoint layout
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 19:41:23 +00:00
16b72b9581 PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 17:27:02 +00:00
9a3bb43f20 Set default max-tokens=512 for reasoning model
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 17:25:05 +00:00
db6e3545da Fix: add _use_runtime_gsa=True to router gate GEMM in single_shot