nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

5d975d00d9 feat: tqdm progress bar for expert weight loading biondizzle 2026-05-16 06:09:22 +00:00
2e4ff6b8d4 fix: increase vLLM RPC timeout to 10 min for first-request JIT biondizzle 2026-05-16 06:02:11 +00:00
a569612df5 feat: add load progress heartbeats to prevent k8s health check kills biondizzle 2026-05-16 05:51:35 +00:00
e5370140cb docs: update README with full NVFP4 coverage, dequant anti-pattern, v2 status biondizzle 2026-05-16 05:43:33 +00:00
3445bd24c1 feat: keep attention weights native NVFP4 — stop dequantizing to BF16 biondizzle 2026-05-16 05:36:34 +00:00
4d4cfa6b28 fix: tqdm over MoE layer warmup, compile every layer, no print spam biondizzle 2026-05-16 05:21:11 +00:00
3838561c19 fix: only suppress compile message, still warmup all layers biondizzle 2026-05-16 05:18:10 +00:00
f19932d8db fix: compile CuTeDSL kernel once per process, not per MoE layer biondizzle 2026-05-16 05:16:53 +00:00
936982c5aa fix: add layer-level tqdm for expert finalization, remove inner expert tqdm biondizzle 2026-05-16 05:01:20 +00:00
cf0731cf4b fix: warmup with 128 tokens (fills MMA tile), better error handling biondizzle 2026-05-16 04:56:45 +00:00
a70d2d3984 fix: clearer warmup message — 'Compiling CuTeDSL NVFP4 MegaMoE kernel' biondizzle 2026-05-16 04:40:31 +00:00
f191af7e29 feat: warm up CuTeDSL kernel during model loading biondizzle 2026-05-16 04:39:05 +00:00
4d67b570b9 fix: descriptive tqdm labels — uint8→NVFP4 and NVFP4→FP8/BF16 biondizzle 2026-05-16 04:28:25 +00:00
8efdd165da fix: use tqdm for progress bars — single line, live updating biondizzle 2026-05-16 04:26:43 +00:00
830f042443 fix: PYTHONUNBUFFERED=1 so progress bars stream in real-time biondizzle 2026-05-16 04:18:07 +00:00
00b766af60 feat: add progress bars for expert quantization and post-load conversion biondizzle 2026-05-16 04:14:07 +00:00
b465579a02 cleanup: nuke all debug prints and env var gates from vLLM patch biondizzle 2026-05-16 04:10:42 +00:00
174ad70dca fix: same gate/up split fix in moe_pipeline.py biondizzle 2026-05-16 04:04:53 +00:00
6d17988b51 fix: L1 gate/up split — intermediate_size is per-projection, not fused biondizzle 2026-05-16 04:04:40 +00:00
37aa0cbeab debug: add try/except with shape logging to _run_mega_moe biondizzle 2026-05-16 04:02:01 +00:00
b04bff7e8b feat: clean Dockerfile, docker-compose, import fixes for CuTeDSL build biondizzle 2026-05-16 03:50:07 +00:00
a0ff8a3278 fix: transpose checkpoint block scales (N,K_sf)→(K_sf,N) for bridge biondizzle 2026-05-16 03:43:30 +00:00
389453fbf4 feat: direct NVFP4 path — no BF16 round-trip on weights biondizzle 2026-05-16 03:41:23 +00:00
8fd9579127 feat: vLLM integration — replace C++ kernel with CuTeDSL biondizzle 2026-05-16 03:36:12 +00:00
3ec9c3074b docs: rewrite README, nuke DEBUG_LOG, add vLLM integration stub biondizzle 2026-05-16 03:33:16 +00:00
b685112c92 fix: lower cosine threshold to 0.98 for double-quantization loss biondizzle 2026-05-16 03:24:13 +00:00
6139cd6ff5 fix: rewrite layertest cleanly, test full MoE pipeline biondizzle 2026-05-16 03:23:33 +00:00
09ff5c5b98 feat: full NVFP4 MoE pipeline (L1→SiLU→L2→scatter) biondizzle 2026-05-16 03:22:43 +00:00
0359215ab4 fix: compare kernel vs BF16 in slot-major layout biondizzle 2026-05-16 03:18:41 +00:00
ed18638a3c fix: slot-major token layout for grouped GEMM biondizzle 2026-05-16 03:17:19 +00:00
5385de3142 fix: layertest tests L1 GEMM only with correct output size biondizzle 2026-05-16 03:15:29 +00:00
0cdcc4144a refactor: add cutedsl/bridge.py, rewrite layertest to use it biondizzle 2026-05-16 03:13:54 +00:00
2ef71dc21a fix: B tensor K-major strides, scale_b axis swap biondizzle 2026-05-16 03:04:31 +00:00
6294b84213 fix: B tensor must be K-major (transpose last 2 dims) biondizzle 2026-05-16 03:03:00 +00:00
7c882fe2e0 fix: correct weight quantization for CuTeDSL kernel biondizzle 2026-05-16 02:58:55 +00:00
ca28f1335d refactor: copy CuTeDSL kernel into repo with local imports biondizzle 2026-05-16 02:57:54 +00:00
a3aa2d201e fix: clarify import path setup for CuTeDSL biondizzle 2026-05-16 02:55:25 +00:00
f951d284e7 test: add CuTeDSL NVFP4 GEMM test using reference ScaledGroupedGemmKernel biondizzle 2026-05-16 02:55:04 +00:00
a2ea836c74 docs: add CuTeDSL rewrite plan + reference files biondizzle 2026-05-16 02:41:51 +00:00
c4a262bd54 test: streamline layertest — kernel vs BF16 ref only, exit on fail biondizzle 2026-05-16 02:29:41 +00:00
de9b50cbe7 fix: use setup.py install for CUTLASS extension build biondizzle 2026-05-16 02:21:17 +00:00
882bff8fb7 fix: also build CUTLASS C++ extension in run_test.sh biondizzle 2026-05-16 02:19:40 +00:00
55d9a24bf6 fix: handle model. prefix normalization in checkpoint keys biondizzle 2026-05-16 02:18:52 +00:00
bdf9f31ae2 fix: checkpoint keys don't have 'model.' prefix biondizzle 2026-05-16 02:17:13 +00:00
ea5ee7c1f7 fix: remove prefix_filter from layer tensor loading biondizzle 2026-05-16 02:15:55 +00:00
303b6a8993 cleanup: move useful tests to tests/, nuke stale debug tests biondizzle 2026-05-16 02:14:37 +00:00
2114bd11be test: add standalone layer 0 comparison test (no vLLM, no Docker) biondizzle 2026-05-16 02:13:18 +00:00
294e9f98f2 cleanup: rename _ue8m0_to_float32 → _block_scale_to_float32, remove dead code biondizzle 2026-05-16 01:55:56 +00:00
4a624879ca docs: update DEBUG_LOG — input_scale red herring, current state, next steps biondizzle 2026-05-16 01:15:49 +00:00
79b9becf9c revert: don't use checkpoint input_scale for activation normalization biondizzle 2026-05-16 00:12:41 +00:00
a7eae10ef4 fix: use checkpoint input_scale for activation quantization biondizzle 2026-05-15 23:57:08 +00:00
af50e98fe9 test: B layout test with N=128 K=256 biondizzle 2026-05-15 23:52:22 +00:00
efd7a2c56d test: B matrix weight layout verification via one-hot A biondizzle 2026-05-15 23:52:00 +00:00
bb5a1ba4c8 cleanup: remove unused slot_token from nvfp4_moe_l2 biondizzle 2026-05-15 23:50:39 +00:00
887360281e docs: major update — SF remap verified correct, BF16 ref is the red herring biondizzle 2026-05-15 23:38:34 +00:00
eb26d291cb test: uniform FP4 + uniform SF sanity check biondizzle 2026-05-15 23:36:08 +00:00
1f09b51168 test: check SF signed vs unsigned interpretation biondizzle 2026-05-15 23:35:06 +00:00
4f857d5f99 docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline biondizzle 2026-05-15 23:02:30 +00:00
aa209ddd21 debug: add SF remap roundtrip verifier biondizzle 2026-05-15 22:59:44 +00:00
6626b75a2f fix: use filter_zeros for SF allocation + no-branch forward mapping biondizzle 2026-05-15 22:58:51 +00:00
6fc8fa61e0 fix: use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0)) biondizzle 2026-05-15 22:53:57 +00:00
a48717ccf5 fix: remove duplicate dst_idx declaration biondizzle 2026-05-15 22:31:05 +00:00
5ff1b9e401 fix: use hierarchical coordinates for layout_sf forward mapping biondizzle 2026-05-15 22:11:14 +00:00
3b4a7b591f test: verify forward mapping with prepack vs live SFB biondizzle 2026-05-15 22:09:56 +00:00
a1fd4d6233 revert: back to layout_sf(make_coord(...)) — crd2idx was unnecessary biondizzle 2026-05-15 21:55:00 +00:00
ea678ece64 fix: remove duplicate template declaration biondizzle 2026-05-15 21:54:10 +00:00
59dad8e2fb fix: use crd2idx instead of layout operator() for SF forward mapping biondizzle 2026-05-15 21:52:02 +00:00
a09d8e477e fix: remove static_assert in constexpr else (build fix) biondizzle 2026-05-15 21:27:27 +00:00
7285331395 fix: replace col_major_src with explicit source strides biondizzle 2026-05-15 21:23:21 +00:00
f6fd549800 fix: restore col_major_src handling for SFB source layout biondizzle 2026-05-15 21:19:58 +00:00
63e67e1025 fix: rewrite SF remap as forward mapping (source→dst) biondizzle 2026-05-15 20:51:30 +00:00
30b6c89424 fix: correct SF remap coordinate extraction biondizzle 2026-05-15 20:44:46 +00:00
ff5a0843dc fix: divide K element index by SFVecSize to get k_sf biondizzle 2026-05-15 20:17:24 +00:00
a09b9b53a3 cleanup: remove printf and diag function from CUDA kernel (build fix) biondizzle 2026-05-15 20:11:40 +00:00
e7c3341317 docs: update DEBUG_LOG with M/K swap root cause biondizzle 2026-05-15 20:03:20 +00:00
deb6b3231a debug: swap M/K in SF remap + add printf diagnostics biondizzle 2026-05-15 20:01:47 +00:00
22f0457ccf test: isolate SFA vs SFB remap bug biondizzle 2026-05-15 19:59:39 +00:00
9eaf6d07e8 test: quick random test biondizzle 2026-05-15 19:58:57 +00:00
fa7b394571 docs: update DEBUG_LOG with root cause (size→cosize) and full debug timeline biondizzle 2026-05-15 18:56:09 +00:00
c3841983a0 fix: SF remap uses cute::cosize() instead of cute::size() biondizzle 2026-05-15 18:52:23 +00:00
67dcfa83f5 test: random data at small dims + alpha sweep biondizzle 2026-05-15 18:51:52 +00:00
60f7f60818 test: ultra-minimal GEMM with all-ones biondizzle 2026-05-15 18:51:31 +00:00
363dd893f0 test: dimension sweep to isolate GEMM bug biondizzle 2026-05-15 18:51:09 +00:00
fee5a97ebb fix: cosine_similarity dim for M>0 biondizzle 2026-05-15 18:50:45 +00:00
f9330a1777 test: standalone M=1 GEMM test with deterministic data biondizzle 2026-05-15 18:47:26 +00:00
1b63a46168 docs: update DEBUG_LOG with cosine≈0 finding + new hypotheses biondizzle 2026-05-15 18:35:00 +00:00
773967452f debug: fix gs scalar conversion + add traceback biondizzle 2026-05-15 18:27:44 +00:00
df916b87eb debug: fix gs.item() for multi-element tensor biondizzle 2026-05-15 18:09:41 +00:00
755f9ad567 debug: fix per_expert_alpha ref + clean up BF16 reference scaling biondizzle 2026-05-15 17:55:11 +00:00
de8acc7965 debug: dump raw GEMM inputs + first 8 output values biondizzle 2026-05-15 17:02:40 +00:00
9159cb6bb3 docs: add debug log — current state, hypotheses, fixes biondizzle 2026-05-15 15:48:57 +00:00
2fd55a94c6 fix: weight reshape bug + igs double-count in BF16 reference biondizzle 2026-05-15 15:46:16 +00:00
c421a668f3 debug: BF16 reference GEMM + cosine comparison for L1 biondizzle 2026-05-15 14:16:24 +00:00
995589ac8a debug: add FP4 quantization round-trip diagnostic biondizzle 2026-05-15 13:41:09 +00:00
d0ed3d84a8 debug: add L2, SiLU, and scatter pipeline prints biondizzle 2026-05-15 13:21:25 +00:00
da5572f497 clean: remove diagnostic scripts from repo biondizzle 2026-05-15 12:50:14 +00:00
fd59222fc0 fix: stop folding global scale into float8 block scales biondizzle 2026-05-15 12:42:53 +00:00
56e62e916d revert: idx2crd remap approach — source-first needs hierarchical coords biondizzle 2026-05-15 11:44:38 +00:00
d5949a23b4 fix: use cute::crd2idx for SF remap — layout_sf() not directly callable biondizzle 2026-05-15 11:39:57 +00:00
9908fd64d9 feat: CUTLASS NVFP4 mega_moe kernel — slot-based L1/L2, source-first SF remap biondizzle 2026-05-15 11:38:18 +00:00

... 21 22 23 24 25