biondizzle
  • Joined on 2025-12-10
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 04:28:32 +00:00
7e3fb5f4d0 fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 04:10:42 +00:00
f52eedbdce Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 04:02:36 +00:00
668a42e71a debug: print mhc_sinkhorn CUDA kernel compile errors
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 03:54:05 +00:00
ca53bdb8e1 perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 03:51:13 +00:00
7b82d31330 perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 03:08:37 +00:00
f0dec9f6bd profile: fine-grained attention component timing
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 02:56:24 +00:00
7114c48575 fix: parenthesize profile_detail condition
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-02 02:46:41 +00:00
4734e894c7 profile: add per-layer attn vs ffn timing with CUDA sync
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 23:55:28 +00:00
4017ef2f16 fix: accurate profile sync + remove paris_tids 129K iteration
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 23:18:44 +00:00
73ae9393da FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 23:14:26 +00:00
36f9782bad Add thinking/Paris token logit check on step 0 for quality debugging
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 23:04:47 +00:00
ef7e0d63bb Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 23:03:48 +00:00
008e59eb90 Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 23:01:37 +00:00
106f42c93c auto: pre-test commit
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 22:33:05 +00:00
e53645654d Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 22:32:42 +00:00
6f4bbc997a Add sync after sampler for step<3 to catch async CUDA errors early
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 22:29:58 +00:00
5493a8727e P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 22:21:33 +00:00
828ba73dff Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 22:21:16 +00:00
583ad6cfe6 P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs
biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel 2026-06-01 22:06:43 +00:00
8767c263ab Add cuda.synchronize + better logits validation after lm_head