nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip biondizzle 2026-05-31 09:14:59 +00:00
db74a887ab Add minimal e2e test + fix MoE expert loop bug (indentation) biondizzle 2026-05-31 09:14:03 +00:00
e195d9d3a7 add SKIP_ROUTED_MOE debug flag, re-enable sinks biondizzle 2026-05-31 07:02:38 +00:00
4f28673bec debug: disable sinks in SDPA to check |X| impact biondizzle 2026-05-31 06:51:58 +00:00
e3db90b56c switch back to original prompt biondizzle 2026-05-31 06:40:01 +00:00
d2cf5ccc32 CRITICAL FIX: use SDPA for short sequences (FMHA padding bug) biondizzle 2026-05-31 06:39:23 +00:00
5f98855141 test with simpler prompt biondizzle 2026-05-31 06:28:45 +00:00
152af7295a debug: compare FMHA vs SDPA output at layer 0 biondizzle 2026-05-31 06:16:58 +00:00
59c75ca4e9 fix: cast attn_out back to BF16 after sink correction biondizzle 2026-05-31 06:07:06 +00:00
e5245ea34e fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong biondizzle 2026-05-31 06:03:13 +00:00
91abf0f921 FMHA + analytic sink bias correction using LSE biondizzle 2026-05-31 05:58:01 +00:00
fac269c938 fix verify_attention: proper multi-head SDPA + GQA biondizzle 2026-05-31 05:55:10 +00:00
2333fc8b4b fix verify_attention.py: proper nvfp4_linear calls biondizzle 2026-05-31 05:53:49 +00:00
c09f68c867 add verify_attention.py: single-layer attention component test biondizzle 2026-05-31 05:51:36 +00:00
04dd7545b3 switch to production FMHA for full run biondizzle 2026-05-31 04:51:16 +00:00
738088cf49 revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach biondizzle 2026-05-31 04:51:10 +00:00
781ee43521 try separate K (RoPE'd) and V (raw) — no inverse RoPE needed biondizzle 2026-05-31 04:46:14 +00:00
889521009b re-enable inverse RoPE (confirmed necessary — without it output is garbage) biondizzle 2026-05-31 04:45:58 +00:00
92e465ca04 debug: disable inverse RoPE to check impact on output biondizzle 2026-05-31 04:40:34 +00:00
c69dc51b3b switch to SDPA with sinks (better residual control) biondizzle 2026-05-31 04:38:41 +00:00
3ed8f3cc44 switch back to production FMHA kernel (with FP4 LUT fix) biondizzle 2026-05-31 04:32:01 +00:00
ae79bd8fce debug: add top-5 logit predictions biondizzle 2026-05-31 04:25:01 +00:00
aafe2eee12 CRITICAL FIX: FP4 LUT was 4x too large! biondizzle 2026-05-31 04:16:13 +00:00
b8c8da91fe fix: restore RoPE functions that were lost during mHC refactor biondizzle 2026-05-31 04:10:51 +00:00
3f04a72af4 refactor: use production mHCLayer from dsv4.layers.mhc biondizzle 2026-05-31 04:06:58 +00:00
b519108cab fix: restore kv_cache.append that was accidentally removed biondizzle 2026-05-31 03:56:58 +00:00
22a89b5a45 add attention sinks to SDPA path (paper D5c) biondizzle 2026-05-31 03:52:59 +00:00
1905f19b8d fix: define q_input before USE_SDPA branch biondizzle 2026-05-31 03:45:09 +00:00
cd073ad867 use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet) biondizzle 2026-05-31 03:42:03 +00:00
171a9e0d10 disable diagnostics for clean production run biondizzle 2026-05-31 03:32:17 +00:00
3f9b441428 diag: fix n_layers reference in forward_layer, add late-layer diags biondizzle 2026-05-31 03:28:53 +00:00
5b834a0599 diag: add late-layer diagnostics, fix ffn ctx variable biondizzle 2026-05-31 03:25:55 +00:00
690c0a1121 CRITICAL FIX: mHC base/scale ordering was wrong biondizzle 2026-05-31 03:16:07 +00:00
c3a2656c48 diag: add FFN and pre_block diagnostics biondizzle 2026-05-31 03:12:52 +00:00
79ba7e6636 diag: add mHC diagnostics for first 3 layers biondizzle 2026-05-31 03:10:05 +00:00
a262492e51 fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm biondizzle 2026-05-31 03:04:53 +00:00
3f12bbc374 fix: move positions tensor to correct GPU for RoPE biondizzle 2026-05-31 02:54:47 +00:00
0c3d168c60 single_shot: stream weights per-layer from CPU, fix KV/RoPE logic biondizzle 2026-05-31 02:53:40 +00:00
61160ace13 fix: expert_weights/ids scoping in hash routing path biondizzle 2026-05-31 02:50:32 +00:00
d772885d7e single_shot_inference: proper mHC+RMSNorm+inverse RoPE pipeline biondizzle 2026-05-31 02:45:52 +00:00
523b0e47b1 Add gentle RMSNorm: only clamps when values exceed unit norm biondizzle 2026-05-31 00:31:34 +00:00
dcbb74841a Remove emergency RMSNorm from mHC post_block — MoE provides balance now biondizzle 2026-05-31 00:27:48 +00:00
1de241ccfe Fix: add all_tokens tracking for decode loop biondizzle 2026-05-31 00:22:08 +00:00
b1dd59293a Add prefill: process prompt tokens to fill KV cache before decoding biondizzle 2026-05-31 00:18:55 +00:00
178fb5483a Fix KV cache: use index 0 (one-layer cache per layer instance) biondizzle 2026-05-31 00:14:58 +00:00
afcc690ddc Add full MoE routing + KV cache to single_shot biondizzle 2026-05-31 00:11:15 +00:00
3ecfbcba57 Fix T scope in post_block biondizzle 2026-05-31 00:02:29 +00:00
a493f72681 Add per-residual RMSNorm in mHC post_block (routed MoE missing) biondizzle 2026-05-30 23:59:19 +00:00
49282fe206 Fix mHC: match vLLM torch reference exactly biondizzle 2026-05-30 23:55:27 +00:00
66a66f8244 Add per-layer NaN tracking for mHC debug biondizzle 2026-05-30 23:48:32 +00:00
d003c4b7cc Add mHC (Manifold-Constrained Hyper-Connections) to single_shot biondizzle 2026-05-30 23:45:18 +00:00
f567c20539 Fix: set active CUDA device per layer for BMM/FMHA biondizzle 2026-05-30 23:39:45 +00:00
7a95983e0f Rewrite single_shot: 8-GPU pipeline parallel biondizzle 2026-05-30 23:36:14 +00:00
aac0fa1f08 Update STATUS.md + MEMORY.md: single-shot inference verified biondizzle 2026-05-30 22:59:27 +00:00
11c010e567 Update output section: kernel verified, architecture gaps noted biondizzle 2026-05-30 22:58:49 +00:00
53178d2536 Add emergency RMSNorm after residuals (missing mHC fallback) biondizzle 2026-05-30 22:56:16 +00:00
172ba75e0c Add per-layer NaN check to track where values diverge biondizzle 2026-05-30 22:54:57 +00:00
ec7846e28c Add NaN tracking to single_shot_inference biondizzle 2026-05-30 22:53:09 +00:00
5fa6c88b17 Fix: replace FP4 Inf with 24 (avoid NaN in dequant) biondizzle 2026-05-30 22:51:10 +00:00
904753f62a Fix: BMM batch dim alignment for wo_a biondizzle 2026-05-30 22:49:21 +00:00
52df3bc26c Fix: wo_a as batched matmul (grouped linear for output projection) biondizzle 2026-05-30 22:48:31 +00:00
19240608d7 Fix: handle o_a_proj grouped linear shape mismatch biondizzle 2026-05-30 22:46:12 +00:00
1d02758416 Fix: kv_proj outputs hd=512 (1 KV head MQA), Z from compressor.gate_proj biondizzle 2026-05-30 22:45:14 +00:00
5dcfb333ea Fix: move weight tensors to CUDA before dequant biondizzle 2026-05-30 22:43:47 +00:00
47c7b3c50b Fix: ensure FP4 LUT on CUDA before index op biondizzle 2026-05-30 22:43:01 +00:00
13bae9dd55 Fix single_shot: mHC replaces layernorm, no hidden-level norm in DSV4 biondizzle 2026-05-30 22:42:17 +00:00
e8334fc4af Rewrite single_shot_inference.py — complete forward pass biondizzle 2026-05-30 22:40:56 +00:00
9b0858aa35 Add single_shot_inference.py — baseline kernel verification biondizzle 2026-05-30 22:39:01 +00:00
4472928506 E3: model construction test biondizzle 2026-05-30 21:22:34 +00:00
afc07a5d1a Update STATUS.md: E5 done biondizzle 2026-05-30 21:21:47 +00:00
df6220abaf E5: Fold batch loop into native kernel grid (blockIdx.z) biondizzle 2026-05-30 21:21:02 +00:00
e162a2d112 Update STATUS.md: E1-E4 done biondizzle 2026-05-30 21:20:10 +00:00
c4b40dd06c E2: CSA/HCA integration test — gather + FMHA end-to-end biondizzle 2026-05-30 21:19:28 +00:00
9d88769f5f Wire indexer compute_index_scores_topk + fix compressor imports biondizzle 2026-05-30 21:19:06 +00:00
daf84524ac E2/E3: compressor bridge, indexer bridge, flush pipeline wiring biondizzle 2026-05-30 21:16:54 +00:00
d3b772196d E3: Implement DSV4Model — full model class biondizzle 2026-05-30 21:15:57 +00:00
b0cdd5af74 fix: extern declarations for gather_swa functions in gather_kv.cu biondizzle 2026-05-30 21:14:15 +00:00
016d722abc fix: single PYBIND11_MODULE for combined gather .so biondizzle 2026-05-30 21:13:24 +00:00
8fb9d89658 fix: correct gather.py kernel_dir path biondizzle 2026-05-30 21:12:09 +00:00
924707a673 fix: add FFNType/RouterMode to LayerSpec in e2e test biondizzle 2026-05-30 21:11:04 +00:00
e2e21c6350 fix: remove unused pytest import from e2e test biondizzle 2026-05-30 21:10:43 +00:00
300dddedc0 E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test biondizzle 2026-05-30 21:10:26 +00:00
faf92b30ad E1: Wire LayerCacheHandle gather methods + CUDA gather kernels biondizzle 2026-05-30 21:09:21 +00:00
4b9eed02e1 Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files biondizzle 2026-05-30 21:08:12 +00:00
a360fa308a P6-P8: Update NEXT_PRIORITIES.md with completion status biondizzle 2026-05-30 17:28:02 +00:00
2c18609296 P8: Fix P6 test imports after deleting multihead module biondizzle 2026-05-30 17:25:01 +00:00
e1b9e94c24 P8: Fix test imports after deleting multihead module biondizzle 2026-05-30 17:23:13 +00:00
95725f1df0 P8: Delete 6 redundant .cuh variants + multihead CAPI/op biondizzle 2026-05-30 17:21:15 +00:00
9d483b1c54 P8: Unified dispatch — multi-tile kernel handles all N biondizzle 2026-05-30 17:19:09 +00:00
e747742598 P7: Document TMEM column layout, add multi-row softmax test biondizzle 2026-05-30 17:17:54 +00:00
f1ce47e3c9 P7: Add TMEM column layout probe test biondizzle 2026-05-30 17:14:50 +00:00
5e5217bfc3 P6: Relax test gate to 0.999990 (SMEM staging adds tiny BF16 noise) biondizzle 2026-05-30 17:13:20 +00:00
11d15d9e72 P6: Clean up test — remove broken TMA store test, update epilogue test biondizzle 2026-05-30 17:12:23 +00:00
c0379a0f86 P6: Remove broken TMA store — use direct GMEM write from SMEM biondizzle 2026-05-30 17:11:17 +00:00
f97359fbfc P6: TMA store uses mbarrier completion (same as load) biondizzle 2026-05-30 17:07:24 +00:00
2de300e281 P6: Try shared::cluster instead of shared::cta for TMA store biondizzle 2026-05-30 17:05:27 +00:00
829a5f93ce P6: Fix TMA store PTX — remove .tile modifier, fix wait_group syntax biondizzle 2026-05-30 17:04:38 +00:00
e4ee9fdc9f P6: Fix host-side BF16→FP32 conversion in test biondizzle 2026-05-30 17:01:13 +00:00
a88b321433 P6: Fix host-side BF16 conversion in test biondizzle 2026-05-30 17:00:51 +00:00
1a87e054db P6: Fix constexpr and bf16 conversion in CUDA test biondizzle 2026-05-30 17:00:05 +00:00

... 4 5 6 7 8 ...