nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

9dbfac9dfa PART A: verify kv_norm_w loaded correctly biondizzle 2026-06-03 07:03:39 +00:00
a682c6adf4 PART A: add raw compressor output diagnostic biondizzle 2026-06-03 06:56:56 +00:00
f2c1b3afd5 PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check biondizzle 2026-06-03 06:33:51 +00:00
86e59c16c5 PART A: add KV gather diagnostics at blowup layer biondizzle 2026-06-03 06:25:35 +00:00
262f844e2e PART A: add detailed blowup diagnostics — capture mHC intermediate values when |X| > 1e6 biondizzle 2026-06-03 06:10:33 +00:00
6459fbca9a fix: import forward_attention biondizzle 2026-06-03 05:41:33 +00:00
91dfac34d8 PART A: simplified to production-only diagnostics — track per-layer |X| during prefill and decode, detect blowup early biondizzle 2026-06-03 05:33:22 +00:00
d99503732d fix: add BF16 gate weight fallback for dense routers (missing from test) biondizzle 2026-06-03 05:22:47 +00:00
801bfc9a83 add router mode debug print biondizzle 2026-06-03 05:15:52 +00:00
b385ecc05e PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step biondizzle 2026-06-03 05:06:40 +00:00
d518fcb82a test: correct sink bias reference — denominator-only, no V contribution biondizzle 2026-06-03 04:57:37 +00:00
9574a9dc2e test: add sink bias to reference SDPA in decode FMHA comparison biondizzle 2026-06-03 04:53:55 +00:00
9a9b347b2b test: add per-head magnitude ratio diagnostics to decode FMHA test biondizzle 2026-06-03 04:50:23 +00:00
f5fa20c581 fix: syntax error — missing closing paren in indexer.forward call biondizzle 2026-06-03 04:46:41 +00:00
693975ec92 fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU biondizzle 2026-06-03 04:46:24 +00:00
e1d96c509d test: decode FMHA layer comparison — checks FMHA accuracy during decode step biondizzle 2026-06-03 04:39:12 +00:00
1ebe7f0dde Add PART_A_NEXT_SESSION.md: clues for decode degeneration debugging biondizzle 2026-06-03 04:34:28 +00:00
d8306be3f2 Fix PART A test: proper FP8 quantization and MQA reference biondizzle 2026-06-03 04:20:36 +00:00
4126909dfb Simplify PART A test: compressor + FMHA at production scale biondizzle 2026-06-03 04:18:13 +00:00
8c54cfa748 Fix KVCache init in PART A test biondizzle 2026-06-03 04:15:41 +00:00
04cf8ca848 Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale biondizzle 2026-06-03 04:13:53 +00:00
75288bd12f Wire prefill FMHA into production.py and single_shot biondizzle 2026-06-03 03:49:57 +00:00
5417f65b08 CRITICAL FIX: Add T-dimension strides to prefill FMHA kernel biondizzle 2026-06-03 03:48:17 +00:00
dd1cbe1faa Fix smem size for prefill debug test biondizzle 2026-06-03 03:47:01 +00:00
09384a637a Fix constexpr issues in prefill debug test biondizzle 2026-06-03 03:46:29 +00:00
d3dc8cf901 Add prefill T=2 debug CUDA test with intermediate value printing biondizzle 2026-06-03 03:46:14 +00:00
223c22488f Simplify prefill PV read: use decode kernel's exact pattern biondizzle 2026-06-03 03:22:49 +00:00
2bf5e74e61 Add prefill debug test: compare T=1 decode vs prefill kernel step by step biondizzle 2026-06-03 03:05:25 +00:00
eb69c3bfb9 CRITICAL FIX: add missing tb base in QK TMEM read address biondizzle 2026-06-03 03:00:57 +00:00
99b6de316b Fix prefill kernel: add missing tb base in PV TMEM read, fix ACCUMULATE for per-row PV biondizzle 2026-06-03 02:59:19 +00:00
9034f67b0f Fix prefill kernel: read ALL n_sub PV results (was only n_sub=0) biondizzle 2026-06-03 02:54:59 +00:00
a4ef6c3454 Add B1 mixed FP8 prefill FMHA kernel (T>1 support) biondizzle 2026-06-03 02:50:27 +00:00
1f757151ef Fix router gate BF16 quantize path for production FMHA test biondizzle 2026-06-03 02:47:47 +00:00
07168357cc Fix o_a_proj weight loading: add BF16 fallback for grouped linear biondizzle 2026-06-03 02:38:00 +00:00
27d8d80a40 Fix missing DEVICE constant in production FMHA test biondizzle 2026-06-03 02:31:11 +00:00
26a817c2f2 Fix production FMHA layer test: compare raw FMHA vs SDPA on production gathered KV biondizzle 2026-06-03 02:26:37 +00:00
ba67e055f7 Add production FMHA layer comparison test biondizzle 2026-06-03 02:22:23 +00:00
af58f2c5b2 Add B1 weight/format verification at L0 in single_shot v-b1-b2-done-20260603 biondizzle 2026-06-03 01:52:55 +00:00
8df5de5477 Update B1 docs with test results and bug fix biondizzle 2026-06-03 01:50:59 +00:00
3e3b352e7e Update FINAL_STRETCH.md: B1 and B2 marked DONE with test results and bug fixes biondizzle 2026-06-03 01:50:21 +00:00
84a02f8995 Remove debug test files, keep production B1/B2 unit tests biondizzle 2026-06-03 01:49:39 +00:00
6fa9ad7852 B2 indexer: adopt TMEM warp-to-row mapping fix biondizzle 2026-06-03 01:42:38 +00:00
6c92ff91f3 B2 indexer: temporary heads 0-31 only while figuring out TMEM row 32-63 layout biondizzle 2026-06-03 01:12:10 +00:00
7732c93f62 Fix B2 indexer: use 16x256b.x1 TMEM read with TMEM_COLS=512 biondizzle 2026-06-03 01:08:48 +00:00
a75a9843af Fix B2 indexer: add sLogits scratch buffer to SMEM layout biondizzle 2026-06-03 00:59:06 +00:00
cc7b17fdaa Fix B2 indexer: use 2-warps for TMEM read (P7 row-slice model) biondizzle 2026-06-03 00:55:27 +00:00
8d0a02ca67 B2 TMEM debug: try stride=SK_TILE/8=16 for row group 32-63 biondizzle 2026-06-03 00:52:32 +00:00
fdf702470c Add B2 TMEM read debug kernel and test biondizzle 2026-06-03 00:50:52 +00:00
f1cf4c0215 Add B2 QK debug test with w_h=1 for simple comparison biondizzle 2026-06-03 00:46:48 +00:00
d36dbba01c Fix B2 indexer: increase TMEM_COLS to 512 for full 128-row MMA output biondizzle 2026-06-03 00:45:15 +00:00
797345dfe9 Add B2 score debug test biondizzle 2026-06-03 00:43:44 +00:00
afb82b9c89 Fix B2 indexer: replace broken 16x256b TMEM read with proven 32x32b.x8 biondizzle 2026-06-03 00:39:49 +00:00
99e50fcb58 Add B2 minimal debug test to find hang point biondizzle 2026-06-03 00:35:48 +00:00
e21bd14408 Fix B1 test LSE reference shape handling biondizzle 2026-06-03 00:25:53 +00:00
4fe7f9dc37 Fix B1 FMHA: swap V matrix canonical layout args (dd, kk) not (kk, dd) biondizzle 2026-06-03 00:24:20 +00:00
29a95a3db6 Add B1 QK vs PV isolation test biondizzle 2026-06-03 00:23:35 +00:00
c322e3f301 Add B1 FMHA debug test for cosine failure investigation biondizzle 2026-06-03 00:22:00 +00:00
5447d1d1dc Add comprehensive B2 FP8 indexer unit test biondizzle 2026-06-03 00:21:29 +00:00
38eecb28d8 Add comprehensive B1 mixed FP8 FMHA unit test biondizzle 2026-06-03 00:20:07 +00:00
f2063c0588 B1: minimal debug test for mixed FP8 FMHA (1 head, N=128) biondizzle 2026-06-03 00:09:36 +00:00
0cea0b33ff B1 test: fix BF16 reference to use PyTorch SDPA biondizzle 2026-06-03 00:07:38 +00:00
a51d19a7fc B1: add mixed FP8 FMHA cosine verification test (HD=512, N=128-2048) biondizzle 2026-06-03 00:06:25 +00:00
b9243fe40a B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k biondizzle 2026-06-02 23:18:54 +00:00
a9d5e09f4c B1: mixed FP8/BF16 decode FMHA integration biondizzle 2026-06-02 22:53:14 +00:00
2eb4f0886e things pre-b1 biondizzle 2026-06-02 22:31:13 +00:00
9d4a014fad Fix NameError: dequantize_nvfp4 not in scope in forward_attention biondizzle 2026-06-02 21:52:29 +00:00
9ba6476d3f auto: pre-test commit biondizzle 2026-06-02 21:39:01 +00:00
845227c06c Fix stale lock file in CUDA loader — prevents infinite spin on crash recovery biondizzle 2026-06-02 21:34:58 +00:00
0b6ca0df80 P5 integration + B3 q_a_norm fused + gsa scalar fix biondizzle 2026-06-02 21:20:34 +00:00
7e42b5e090 A1: Add ◇ (think_start) priming after Assistant token biondizzle 2026-06-02 20:23:47 +00:00
ac4eedc444 auto: pre-test commit biondizzle 2026-06-02 20:16:43 +00:00
ecd48ab65e A1: Add explicit stop set for DSV4 turn-end tokens biondizzle 2026-06-02 19:59:52 +00:00
35dbb8d12b Cleanup Part 2: Fix docs, stale references, dead code biondizzle 2026-06-02 19:27:28 +00:00
f3b551956d Cleanup Step 2: Archive Lineage P code, fix broken imports biondizzle 2026-06-02 19:27:07 +00:00
8de47e26ce Cleanup Step 1: Move root-level files to proper directories biondizzle 2026-06-02 19:24:39 +00:00
b111525af4 Fix indexer documentation and safety issues biondizzle 2026-06-02 19:08:40 +00:00
d770111cb1 Remove stale duplicate .cu files from indexer/ subfolder biondizzle 2026-06-02 18:49:40 +00:00
eb5ef93bf1 Add A/B comparison mode for P4 fused vs unfused RMSNorm+quantize biondizzle 2026-06-02 18:49:30 +00:00
b8bab01a55 Update PERFORMANCE_AUDIT.md — P4 done, P5 kernel done (pending integration) biondizzle 2026-06-02 18:26:01 +00:00
8447ba7138 FIX: Deadlock in indexer_score_topk kernel — __syncthreads inside strided loop biondizzle 2026-06-02 18:11:56 +00:00
c926c4a597 P5: Fix mhc_rmsnorm_quantize_nvfp4 — add proper function definition biondizzle 2026-06-02 17:57:33 +00:00
36fdbeb56d stuff biondizzle 2026-06-02 17:51:46 +00:00
bdf0b15d45 P4: Fix rmsnorm_quantize_nvfp4 returns QuantizedActivation not tuple biondizzle 2026-06-02 17:43:21 +00:00
454dbdad52 P5: Fused mHC pre_block + RMSNorm + NVFP4 quantize kernel biondizzle 2026-06-02 16:39:42 +00:00
7bb3207347 P4: Integrate fused RMSNorm+quantize into single_shot (attention path) biondizzle 2026-06-02 16:38:44 +00:00
0d1cd1e216 P4: Add QuantizedActivation + Nvfp4Linear.run_from_quantized biondizzle 2026-06-02 16:37:38 +00:00
149ecefb56 P4: Relax test thresholds — per-row gsa vs scalar gsa difference expected biondizzle 2026-06-02 16:34:49 +00:00
57ab4b9d4c P4: Fix dequantize_nvfp4 bridge — handle float8_e4m3fn dtype biondizzle 2026-06-02 16:31:56 +00:00
29f836d711 P4: Fix fused RMSNorm kernel — match quantize_nvfp4.cu encoding biondizzle 2026-06-02 16:28:44 +00:00
794ebaf7e5 P4: Fused RMSNorm + NVFP4 quantize kernel (2 launches vs 6+) biondizzle 2026-06-02 16:26:24 +00:00
82294fc21e Fix nope_dim UnboundLocalError — hoist to function scope biondizzle 2026-06-02 11:18:58 +00:00
e231b98387 Fix mHC Sinkhorn test: row sums expected to be off (eps after softmax) biondizzle 2026-06-02 10:46:28 +00:00
b5f29be169 Add mHC Sinkhorn CUDA kernel test biondizzle 2026-06-02 10:45:02 +00:00
6cb5078821 Fix mHC Sinkhorn kernel: remove VLA, remove Python fallback biondizzle 2026-06-02 10:44:53 +00:00
c89762ecdd Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage biondizzle 2026-06-02 10:20:26 +00:00
1f69f61363 Add detailed comment: why compressed KV uses FP8 not NVFP4 biondizzle 2026-06-02 10:19:54 +00:00
edc8e7ee8d KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format) biondizzle 2026-06-02 10:08:43 +00:00
12b6365b42 Fix RoPE test: use proper cos/sin cache biondizzle 2026-06-02 10:04:01 +00:00
f566b9b748 Fix FP8 quantize return type (2-tuple not 3) biondizzle 2026-06-02 10:02:01 +00:00
bdb25ee5cd Add production-value unit tests for kv_quantize kernels biondizzle 2026-06-02 10:01:07 +00:00

1 2 3 4 5 ...