nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

7804b779ce diag: print wo_a g_flat magnitude to find where zeros come from biondizzle 2026-06-01 01:40:53 +00:00
efe63caea9 diag: print FMHA output magnitude for first 3 layers biondizzle 2026-06-01 01:34:02 +00:00
7fbbdc5204 diag: validate router output before MoE biondizzle 2026-06-01 01:27:16 +00:00
f5fa84016e diag: sync+error check after each layer on first token biondizzle 2026-06-01 01:26:50 +00:00
91b3929605 fix: call moe_runner.run() and se_runner.run() (not __call__) biondizzle 2026-06-01 01:14:38 +00:00
03c45d4bfb fix: pass int32 token_ids to hash router (was int64) biondizzle 2026-06-01 01:08:03 +00:00
62efde5c9f fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused) biondizzle 2026-06-01 01:01:15 +00:00
5591a725e1 fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM) biondizzle 2026-06-01 00:59:18 +00:00
0ab5d8c317 fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths) biondizzle 2026-06-01 00:56:00 +00:00
c339fe7ad9 fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error biondizzle 2026-06-01 00:54:19 +00:00
b7a8c44d26 single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag biondizzle 2026-06-01 00:42:08 +00:00
15f45b57c3 fix: correct Nvfp4Linear dimension inference from checkpoint weights biondizzle 2026-06-01 00:32:36 +00:00
e671780008 fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert biondizzle 2026-06-01 00:30:37 +00:00
e8a7a9256f fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM biondizzle 2026-06-01 00:18:34 +00:00
172448514c fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM biondizzle 2026-06-01 00:10:50 +00:00
563df02aef fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError) biondizzle 2026-06-01 00:04:48 +00:00
be476b2ce2 router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init biondizzle 2026-06-01 00:00:07 +00:00
56dff8d185 fix: W_gate is (H, E) but F.linear expects (E, H), transpose before linear biondizzle 2026-05-31 23:55:16 +00:00
5396a04c28 router: broaden except to catch all CuTeDSL errors, fall through to cuBLAS+activation_topk path biondizzle 2026-05-31 23:54:16 +00:00
3b5b9f487c fix: compute num_tma_load_bytes inside cute.compile context biondizzle 2026-05-31 23:53:13 +00:00
1bc0da0f35 fix: properly scope swap code inside else/guard blocks, replace continue with if guard biondizzle 2026-05-31 23:51:43 +00:00
d0d765e1f2 fix: replace break statements with flag-based loops in router kernel (CuTeDSL restriction) biondizzle 2026-05-31 23:50:39 +00:00
210391e571 fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape biondizzle 2026-05-31 23:49:12 +00:00
824d054ad7 fix: inside cute.compile args are already CuTe tensors, no conversion needed biondizzle 2026-05-31 23:47:33 +00:00
6375e54396 fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router biondizzle 2026-05-31 23:46:35 +00:00
cb2ca8591f fix: add @cute.jit to router compiled function biondizzle 2026-05-31 23:44:53 +00:00
d5d2b7b4b8 fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern) biondizzle 2026-05-31 23:44:00 +00:00
157f1c5258 fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel biondizzle 2026-05-31 23:39:50 +00:00
1dbc57e2cd fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time) biondizzle 2026-05-31 23:36:01 +00:00
d05dd50bf5 fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API) biondizzle 2026-05-31 23:34:54 +00:00
a6a8755439 single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128) biondizzle 2026-05-31 23:33:32 +00:00
80002f2efc single_shot: production NVFP4 GEMM for ALL attention projections biondizzle 2026-05-31 23:28:16 +00:00
32efd5139d Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E) biondizzle 2026-05-31 23:21:09 +00:00
e45c0ff51b single_shot: use reference dequant for attn projections, focus on MoE+FMHA biondizzle 2026-05-31 23:20:04 +00:00
dfbffa1df1 single_shot: CUDA_LAUNCH_BLOCKING for debugging biondizzle 2026-05-31 23:18:35 +00:00
a66fdf6049 single_shot: add sync to catch CUDA errors early biondizzle 2026-05-31 23:17:46 +00:00
0b35c36d23 single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init biondizzle 2026-05-31 23:16:45 +00:00
050b5ee449 Fix n_h reference before assignment in single_shot biondizzle 2026-05-31 23:14:24 +00:00
c5adbbfde6 FMHA sink: don't double-scale sink bias biondizzle 2026-05-31 23:12:20 +00:00
4adee1207f FMHA: zero-init my_p_vals to fix N<128 padding NaN biondizzle 2026-05-31 23:11:12 +00:00
13be3ad443 FMHA sink bias in kernel + single_shot production rewrite biondizzle 2026-05-31 23:10:13 +00:00
23e88638aa single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer) biondizzle 2026-05-31 22:55:11 +00:00
92200367f3 FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128 biondizzle 2026-05-31 22:52:39 +00:00
d40821c843 single_shot: fix memory (no double-loading MoE weights), FMHA short-seq fallback biondizzle 2026-05-31 22:49:15 +00:00
91568e12d4 single_shot_inference.py: production kernel stack version biondizzle 2026-05-31 22:45:44 +00:00
fb96c34b89 rename: single_shot_inference.py → single_shot_PYTORCH_REFERENCE.py biondizzle 2026-05-31 22:42:06 +00:00
79d1a83348 Add NEXT_STEPS.md: post v0.1 issues, kernel migration plan, lessons learned biondizzle 2026-05-31 22:30:34 +00:00
acc20dffd7 CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant v0.1-e2e-working biondizzle 2026-05-31 22:03:55 +00:00
4e64acbb64 fix MoE gate BF16/NVFP4 handling, add attention diagnostics biondizzle 2026-05-31 21:57:47 +00:00
0d2b5ceb93 fix positions device mismatch: move to rope cache device in forward_attention biondizzle 2026-05-31 21:54:56 +00:00
2676476013 fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16 biondizzle 2026-05-31 21:51:59 +00:00
eb08cd06d1 Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected biondizzle 2026-05-31 21:48:59 +00:00
4988e77179 probe key format biondizzle 2026-05-31 21:42:52 +00:00
ba915dbd53 add probe_shapes script biondizzle 2026-05-31 21:41:31 +00:00
c54dd15550 find hc keys biondizzle 2026-05-31 21:38:43 +00:00
52b4971711 Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn) biondizzle 2026-05-31 21:36:17 +00:00
cec17fee7d fixed prefix biondizzle 2026-05-31 21:26:04 +00:00
696f3261ab focused key dump biondizzle 2026-05-31 21:25:31 +00:00
b7c9bb1262 dump all keys biondizzle 2026-05-31 21:24:58 +00:00
54e2a3684a filter expert keys biondizzle 2026-05-31 21:24:35 +00:00
bafabda01f add checkpoint key dump script biondizzle 2026-05-31 21:24:14 +00:00
23f1cf4065 Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference) biondizzle 2026-05-31 21:13:21 +00:00
274ea13251 Fix critical bug: add hc_head for final mHC readout (was using stream 0) biondizzle 2026-05-31 21:13:02 +00:00
baee36e728 Fix dtype mismatch in validate_layer: cast flat to float before F.linear biondizzle 2026-05-31 20:23:18 +00:00
46c4ef2cf5 Add per-layer validation test (tests/validate_layer.py) biondizzle 2026-05-31 20:22:13 +00:00
abe4210367 Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG biondizzle 2026-05-31 20:21:03 +00:00
98fa410167 Add HF reference test script biondizzle 2026-05-31 20:11:37 +00:00
a1b39adcaa Add attention entropy diag (ATTN_DIAG), KV cache diag, --no-thinking mode biondizzle 2026-05-31 19:29:55 +00:00
2a886fe0f2 Add --no-thinking mode to skip thinking tokens and use second-best biondizzle 2026-05-31 19:24:21 +00:00
41ef0ebd0f Add KV cache length diagnostic during decode biondizzle 2026-05-31 19:17:24 +00:00
8baebf3c2e Restore --skip-mhc arg, empty system prompt for testing biondizzle 2026-05-31 19:04:53 +00:00
ca661d32e8 Empty system prompt for testing (was causing model to regurgitate AI assistant tokens) biondizzle 2026-05-31 19:03:55 +00:00
b09b2cf511 Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+ biondizzle 2026-05-31 18:52:38 +00:00
7d9e70c5d5 Fix remaining mHC API references: layer_compare.py, layer.py comment biondizzle 2026-05-31 18:38:34 +00:00
7b123d159f CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax biondizzle 2026-05-31 18:38:12 +00:00
f6c02f808f Add layer-by-layer comparison test for debugging biondizzle 2026-05-31 12:48:43 +00:00
6ad577bd18 Add HuggingFace reference comparison test biondizzle 2026-05-31 12:05:19 +00:00
581c4170f9 Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len) biondizzle 2026-05-31 11:57:23 +00:00
0f951a0b1a Fix attention sinks: logit bias (HuggingFace reference), not dummy KV biondizzle 2026-05-31 11:53:43 +00:00
daed594902 CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj) biondizzle 2026-05-31 11:47:16 +00:00
dd50c355a6 Fix MHC_DIAG null check when SKIP_MHC is enabled biondizzle 2026-05-31 11:37:32 +00:00
631e6ea3e4 Add --skip-mhc flag for simple residual diagnostic biondizzle 2026-05-31 11:33:41 +00:00
d201a9334e CRITICAL FIX: Add YaRN RoPE scaling (factor=16) biondizzle 2026-05-31 11:25:52 +00:00
88719f39b4 Add single-layer trace (Phase 2.6) for detailed debugging biondizzle 2026-05-31 11:20:46 +00:00
8256e23aed Fix mHCContext attribute access (not tuple unpacking) and enable attention diag biondizzle 2026-05-31 11:10:37 +00:00
72c139a59f Enable MHC_DIAG for diagnostic run biondizzle 2026-05-31 11:07:23 +00:00
cd661c2e40 Add attention and Q/KV diagnostics (MHC_DIAG flag) biondizzle 2026-05-31 11:07:17 +00:00
9584fcbc23 Fix top5_ids variable name in decode logging biondizzle 2026-05-31 10:54:40 +00:00
a6d56d10ca Add top-20 logging and thinking token detection in decode loop biondizzle 2026-05-31 10:49:28 +00:00
d891ae7e96 Fix prompt format: use DeepSeek V4 chat tokens biondizzle 2026-05-31 10:33:41 +00:00
f86742ef8e Cache layer weights on GPU — eliminates per-token CPU→GPU transfer biondizzle 2026-05-31 10:28:25 +00:00
ce3d6069cc CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post] biondizzle 2026-05-31 10:07:14 +00:00
9a43e9aa77 CRITICAL FIX: mHC fn weight row ordering was wrong biondizzle 2026-05-31 10:02:57 +00:00
0346e479d4 Add system prompt, CLI args, inverse RoPE flag, minimal e2e test biondizzle 2026-05-31 09:56:18 +00:00
429fc3db40 Fix expert weight indexing for 1D tensor biondizzle 2026-05-31 09:23:10 +00:00
33004dcbf4 Fix expert weight broadcasting (wt.item() for scalar multiply) biondizzle 2026-05-31 09:22:27 +00:00
1434b35971 Add residual diagnostic test — per-layer magnitude tracking biondizzle 2026-05-31 09:21:41 +00:00
1c18c16c68 Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16 biondizzle 2026-05-31 09:17:36 +00:00
970869d017 Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected) biondizzle 2026-05-31 09:17:07 +00:00
a2ee78b564 Fix RoPE shape bug (interleave needs separate even/odd assembly) biondizzle 2026-05-31 09:15:59 +00:00

... 3 4 5 6 7 ...