biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 22:05:23 +00:00

2a6f9a10b1 lm_head: fall back to BF16 F.linear for stability

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:59:25 +00:00

9bad30c777 Add logits validation debug before topk sampling

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:34:04 +00:00

9fec7d609e Fix gsa_buffer shape mismatch for MoE (M>1 rows)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:26:55 +00:00

cacf64232e CRITICAL FIX: fused_amax_quantize cross-CTA race condition

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:18:44 +00:00

e3412cf913 P5: In-place RoPE — no x.clone(), no empty_like allocation

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:18:31 +00:00

00746c2d2b Fix module path: move loader code from __init__.py to loader.py

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:11:03 +00:00

230d28e562 Fix KVCache constructor call — device as keyword arg, not positional

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:08:02 +00:00

c9b92cd840 Remove P1 from audit — multi-GPU layout is correct for the reference script

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:05:07 +00:00

c8faf20a99 P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 21:02:08 +00:00

e0607c9e2f P0: Add fused_amax_quantize.cu kernel + CUDA module loader with compile-once caching

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 20:55:46 +00:00

d279965db4 Update PERFORMANCE_AUDIT.md: remove invalidated items, add WIP status

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 20:49:57 +00:00

60715f89bc Fix CUDA kernel compilation: use c10::cuda::getCurrentCUDAStream

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 20:42:54 +00:00

2dc5b4ec19 Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 20:40:31 +00:00

360f76b970 Performance audit fixes: eliminate CPU-GPU syncs

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 20:30:03 +00:00

4f698baa5d Production fused CUDA sampler + decode loop optimizations

biondizzle pushed tag v-e2e-nvfp4-all-projections to biondizzle/nvfp4-megamoe-kernel

2026-06-01 20:21:14 +00:00

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 19:51:24 +00:00

2830a3ee7c Fix lm_head NVFP4: transpose weight and scales to match Nvfp4Linear checkpoint layout

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 19:41:23 +00:00

16b72b9581 PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 17:27:02 +00:00

9a3bb43f20 Set default max-tokens=512 for reasoning model

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 17:25:05 +00:00

db6e3545da Fix: add _use_runtime_gsa=True to router gate GEMM in single_shot