biondizzle

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 04:28:32 +00:00

7e3fb5f4d0 fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 04:10:42 +00:00

f52eedbdce Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 04:02:36 +00:00

668a42e71a debug: print mhc_sinkhorn CUDA kernel compile errors

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 03:54:05 +00:00

ca53bdb8e1 perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 03:51:13 +00:00

7b82d31330 perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 03:08:37 +00:00

f0dec9f6bd profile: fine-grained attention component timing

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 02:56:24 +00:00

7114c48575 fix: parenthesize profile_detail condition

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-02 02:46:41 +00:00

4734e894c7 profile: add per-layer attn vs ffn timing with CUDA sync

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 23:55:28 +00:00

4017ef2f16 fix: accurate profile sync + remove paris_tids 129K iteration

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 23:18:44 +00:00

73ae9393da FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 23:14:26 +00:00

36f9782bad Add thinking/Paris token logit check on step 0 for quality debugging

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 23:04:47 +00:00

ef7e0d63bb Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 23:03:48 +00:00

008e59eb90 Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling)

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 23:01:37 +00:00

106f42c93c auto: pre-test commit

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 22:33:05 +00:00

e53645654d Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 22:32:42 +00:00

6f4bbc997a Add sync after sampler for step<3 to catch async CUDA errors early

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 22:29:58 +00:00

5493a8727e P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 22:21:33 +00:00

828ba73dff Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 22:21:16 +00:00

583ad6cfe6 P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs

biondizzle pushed to master at biondizzle/nvfp4-megamoe-kernel

2026-06-01 22:06:43 +00:00

8767c263ab Add cuda.synchronize + better logits validation after lm_head