nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

7ef6402936 KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys biondizzle 2026-06-02 10:00:50 +00:00
40dd56eac2 KV-1: Fix shared memory corruption in block_reduce biondizzle 2026-06-02 09:49:12 +00:00
0fefadedd4 KV-1: Fix FP8 round-trip mismatch in fused quantize biondizzle 2026-06-02 09:46:32 +00:00
d74ff5768d KV diag test biondizzle 2026-06-02 09:43:45 +00:00
c2664281c3 KV-1/KV-2: Fix quantize kernel — each thread handles 16-elem blocks independently biondizzle 2026-06-02 09:41:15 +00:00
f23320b5b2 KV-1/KV-2: Fused compress+NVFP4 quantize kernels + dequant biondizzle 2026-06-02 09:37:53 +00:00
107d62dd76 docs: update PERFORMANCE_AUDIT.md — Part 1 (P0-P3) landed, Part 2 KV cache next biondizzle 2026-06-02 09:30:06 +00:00
3c295f225a P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated v-p0p1p2p3-fused-swiglu-cuda-rope-20260602 biondizzle 2026-06-02 09:08:07 +00:00
54a9b6961b fix: rope_cuda path — kernels/cuda not ops/cuda biondizzle 2026-06-02 09:06:36 +00:00
2bbbead984 P3: CUDA RoPE kernel — single launch per call (vs 5-6 PyTorch ops) biondizzle 2026-06-02 09:05:22 +00:00
851ec9b4d5 P3 WIP: fused RMSNorm + quantize kernel skeleton (not yet integrated) biondizzle 2026-06-02 09:02:52 +00:00
b13c1057f5 test: verify GEMM shape with production weight format biondizzle 2026-06-02 08:43:40 +00:00
40fb49d670 test: verify GEMM output shape biondizzle 2026-06-02 08:41:22 +00:00
f01d3f3eac wip: SE fused SwiGLU deinterleave fix biondizzle 2026-06-02 08:41:00 +00:00
1726cb64a9 fix: interleave_l1_weights granularity_bf16 (not granularity) in SE biondizzle 2026-06-02 08:29:03 +00:00
553275d810 feat: P1 — add eager warmup_fused_swiglu_compilation for SharedExpert (1-group) biondizzle 2026-06-02 08:25:52 +00:00
5ed4c86137 fix: expert_offsets for 4-expert fused SwiGLU test biondizzle 2026-06-02 08:24:32 +00:00
53362d2579 test: isolate fused SwiGLU — test no-clamp first biondizzle 2026-06-02 08:23:28 +00:00
ae4506d722 fix: w_gs is scalar not iterable biondizzle 2026-06-02 08:22:29 +00:00
b0c71b947e test: fused SwiGLU — smoke test + correctness comparison with graceful degradation biondizzle 2026-06-02 08:21:33 +00:00
2cfca36095 fix: compute correct gs from data in fused SwiGLU test biondizzle 2026-06-02 08:20:27 +00:00
4a05a40cf0 fix: fused SwiGLU test — proper weight quant + 128-token alignment biondizzle 2026-06-02 08:19:31 +00:00
fa769b6214 fix: pad activation as uint8 view for float4 dtype biondizzle 2026-06-02 08:18:26 +00:00
024be1a60b fix: test weight quantization dtype for fused SwiGLU test biondizzle 2026-06-02 08:17:35 +00:00
19afa52e80 fix: use cute.where() directly for clamp in fused SwiGLU biondizzle 2026-06-02 08:16:41 +00:00
5c746bbdf2 fix: TensorSSA-compatible clamp in fused SwiGLU kernel biondizzle 2026-06-02 08:15:46 +00:00
3a30f35c68 fix: cute.math.fmin/fmax → cute.arch.fmin/fmax in fused SwiGLU kernel biondizzle 2026-06-02 08:12:55 +00:00
fca72427ea fix: add fp4_out/sf_out/l2_global_scale params to fused_swiglu kernel() signature biondizzle 2026-06-02 08:11:18 +00:00
55ea109cca test: fused SwiGLU kernel compilation + correctness (P0/P1 gate) biondizzle 2026-06-02 08:09:57 +00:00
7904cf05c4 Add set_fused_swiglu() method to Nvfp4MoE biondizzle 2026-06-02 07:59:57 +00:00
d8e17d70c1 P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_ biondizzle 2026-06-02 07:57:39 +00:00
61d5e7ba53 revert: P2 gsa fill elimination — revert to proven path for e2e stability v-perf-part1-p2-reverted-20260602 biondizzle 2026-06-02 07:32:10 +00:00
790f8c350a perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug. biondizzle 2026-06-02 07:16:08 +00:00
040b2eb6e7 perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill biondizzle 2026-06-02 06:59:25 +00:00
e9506e0c20 perf: C1/C2/C3 — per-layer max_comp, pre-allocated gather_buf, SWA views v-post-indexer-c-fixes-20260602 v-c1-c2-c3-20260602 biondizzle 2026-06-02 06:18:06 +00:00
617da29a5b fix: assert topk_idx is not None in CSA layers — no silent fallback to SWA-only biondizzle 2026-06-02 06:14:23 +00:00
5b4c496512 fix: three indexer bugs — weight path, comp_idx_buf width, scoring einsum v-indexer-fix-20260602 biondizzle 2026-06-02 05:53:10 +00:00
0fbf28dd54 doc: INDEXER_PROBE_RESULTS_20260602 — compressed key width is ihd=128, not n_ih*ihd=8192 biondizzle 2026-06-02 05:51:24 +00:00
8162c586c3 probe: fix comp_idx_buf width to ihd=128 so indexer probe can complete biondizzle 2026-06-02 05:38:44 +00:00
5be31d8582 fix: indexer compressor weight path — weights are at *.indexer.kv_proj not *.indexer.compressor.kv_proj biondizzle 2026-06-02 05:25:44 +00:00
fdfcca918c probe: verify indexer compressor load state biondizzle 2026-06-02 05:17:00 +00:00
fb0ed87626 probe: add indexer compressor early-return and buffering diagnostics biondizzle 2026-06-02 05:06:18 +00:00
06c92f208f INDEXER PROBE: instrumentation prints for compressed key width investigation biondizzle 2026-06-02 04:44:47 +00:00
510eaf4a26 probe: HF indexer architecture from B200 biondizzle 2026-06-02 04:38:24 +00:00
938e9079ce probe: indexer and compressor weight shapes from checkpoint biondizzle 2026-06-02 04:36:35 +00:00
9254cb0b0d test: NVFP4 runtime gsa accuracy vs PyTorch reference biondizzle 2026-06-02 04:31:18 +00:00
7e3fb5f4d0 fix: add missing import for quantize_nvfp4_gpu in linear.py fixed-gsa path biondizzle 2026-06-02 04:28:29 +00:00
f52eedbdce Add production-value tests: ALL tests use Pro config (61L, HD=512, 384 experts, HCA=128, 1M context) biondizzle 2026-06-02 04:10:39 +00:00
668a42e71a debug: print mhc_sinkhorn CUDA kernel compile errors biondizzle 2026-06-02 04:02:34 +00:00
ca53bdb8e1 perf: skip MQA GQA expansion in FMHA (stride=0, no 128x K/V copy) biondizzle 2026-06-02 03:54:03 +00:00
7b82d31330 perf: fused mHC Sinkhorn CUDA kernel (1 launch vs 38) biondizzle 2026-06-02 03:50:57 +00:00
f0dec9f6bd profile: fine-grained attention component timing biondizzle 2026-06-02 03:08:34 +00:00
7114c48575 fix: parenthesize profile_detail condition biondizzle 2026-06-02 02:56:13 +00:00
4734e894c7 profile: add per-layer attn vs ffn timing with CUDA sync biondizzle 2026-06-02 02:46:35 +00:00
4017ef2f16 fix: accurate profile sync + remove paris_tids 129K iteration biondizzle 2026-06-01 23:55:26 +00:00
73ae9393da FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536 biondizzle 2026-06-01 23:18:37 +00:00
36f9782bad Add thinking/Paris token logit check on step 0 for quality debugging biondizzle 2026-06-01 23:14:24 +00:00
ef7e0d63bb Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches biondizzle 2026-06-01 23:04:44 +00:00
008e59eb90 Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling) biondizzle 2026-06-01 23:03:46 +00:00
106f42c93c auto: pre-test commit biondizzle 2026-06-01 23:01:34 +00:00
e53645654d Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float biondizzle 2026-06-01 22:33:03 +00:00
6f4bbc997a Add sync after sampler for step<3 to catch async CUDA errors early biondizzle 2026-06-01 22:32:40 +00:00
5493a8727e P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16 biondizzle 2026-06-01 22:29:56 +00:00
828ba73dff Update PERFORMANCE_AUDIT.md: P0 complete, P2/P3/P5 done biondizzle 2026-06-01 22:21:31 +00:00
583ad6cfe6 P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs biondizzle 2026-06-01 22:21:12 +00:00
8767c263ab Add cuda.synchronize + better logits validation after lm_head biondizzle 2026-06-01 22:06:41 +00:00
2a6f9a10b1 lm_head: fall back to BF16 F.linear for stability biondizzle 2026-06-01 22:05:22 +00:00
9bad30c777 Add logits validation debug before topk sampling biondizzle 2026-06-01 21:59:23 +00:00
9fec7d609e Fix gsa_buffer shape mismatch for MoE (M>1 rows) biondizzle 2026-06-01 21:33:59 +00:00
cacf64232e CRITICAL FIX: fused_amax_quantize cross-CTA race condition biondizzle 2026-06-01 21:26:51 +00:00
e3412cf913 P5: In-place RoPE — no x.clone(), no empty_like allocation biondizzle 2026-06-01 21:18:41 +00:00
00746c2d2b Fix module path: move loader code from __init__.py to loader.py biondizzle 2026-06-01 21:18:29 +00:00
230d28e562 Fix KVCache constructor call — device as keyword arg, not positional biondizzle 2026-06-01 21:11:01 +00:00
c9b92cd840 Remove P1 from audit — multi-GPU layout is correct for the reference script biondizzle 2026-06-01 21:07:59 +00:00
c8faf20a99 P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path biondizzle 2026-06-01 21:05:03 +00:00
e0607c9e2f P0: Add fused_amax_quantize.cu kernel + CUDA module loader with compile-once caching biondizzle 2026-06-01 21:02:03 +00:00
d279965db4 Update PERFORMANCE_AUDIT.md: remove invalidated items, add WIP status biondizzle 2026-06-01 20:55:44 +00:00
60715f89bc Fix CUDA kernel compilation: use c10::cuda::getCurrentCUDAStream biondizzle 2026-06-01 20:49:55 +00:00
2dc5b4ec19 Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128 biondizzle 2026-06-01 20:42:53 +00:00
360f76b970 Performance audit fixes: eliminate CPU-GPU syncs biondizzle 2026-06-01 20:40:19 +00:00
4f698baa5d Production fused CUDA sampler + decode loop optimizations biondizzle 2026-06-01 20:29:57 +00:00
2830a3ee7c Fix lm_head NVFP4: transpose weight and scales to match Nvfp4Linear checkpoint layout v-e2e-nvfp4-all-projections biondizzle 2026-06-01 19:51:21 +00:00
16b72b9581 PERF: Eliminate double quantization for o_a_proj + NVFP4 lm_head biondizzle 2026-06-01 19:41:21 +00:00
9a3bb43f20 Set default max-tokens=512 for reasoning model biondizzle 2026-06-01 17:27:01 +00:00
db6e3545da Fix: add _use_runtime_gsa=True to router gate GEMM in single_shot biondizzle 2026-06-01 17:25:04 +00:00
9d57b0453b auto: pre-test commit biondizzle 2026-06-01 15:04:46 +00:00
1a6d9ee29b Reset to greedy decoding (temperature=0) biondizzle 2026-06-01 15:04:02 +00:00
038fe81c68 Fix MoE non-fused L2 runtime gsa + update test harness for extra args biondizzle 2026-06-01 15:03:54 +00:00
a48d6e14ae Default temperature=0.7 with rep penalty biondizzle 2026-06-01 14:55:43 +00:00
1d64b863ca Add temperature sampling + repetition penalty to fix degenerate repetition biondizzle 2026-06-01 14:54:49 +00:00
6cca16f97a Set max-tokens=128 default, clean up for final verification biondizzle 2026-06-01 14:43:48 +00:00
a0e758ec3b Set default max-tokens=30 for faster iteration biondizzle 2026-06-01 14:33:55 +00:00
2b1fca6dae CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow biondizzle 2026-06-01 14:21:16 +00:00
3b2714410f Add NVFP4 linear accuracy test: prod vs ref with all-ones input biondizzle 2026-06-01 14:15:27 +00:00
3e47d5f20a Add prod vs ref GEMM comparison test + gate logits diagnostic biondizzle 2026-06-01 14:11:37 +00:00
ad143afe37 Add L58-60 diagnostic: mHC A/B/C, MoE routed/shared, topk biondizzle 2026-06-01 13:55:55 +00:00
7a05d3d3af NVFP4 router gate: use Nvfp4Linear for both checkpoint and quantized paths biondizzle 2026-06-01 11:25:50 +00:00
e5dbe1ed22 Switch router to Nvfp4Linear production GEMM (custom CuTeDSL kernel crashes MLIR) biondizzle 2026-06-01 11:17:54 +00:00
a4324781c3 Fix: properly remove sqrt(softplus) from CuTeDSL kernel biondizzle 2026-06-01 11:14:04 +00:00
6efe90cd85 Move sqrt(softplus) out of CuTeDSL kernel into Python biondizzle 2026-06-01 11:12:41 +00:00

1 2 3 4 5 ...