nvfp4-megamoe-kernel

biondizzle/nvfp4-megamoe-kernel

Fork 0

Commit Graph

Select branches

Hide Pull Requests

master

pre-b1

pure-nvfp4

v-b1-b2-done-20260603

v-c1-c2-c3-20260602

v-e2e-nvfp4-all-projections

v-e2e-paris-32tok-20260601-0549

v-indexer-fix-20260602

v-nvfp4-fused-router-rewrite-20260601-0715

v-nvfp4-router-oa-20260601-0610

v-official-encoding-path

v-p0p1p2p3-fused-swiglu-cuda-rope-20260602

v-perf-part1-p2-reverted-20260602

v-post-indexer-c-fixes-20260602

v-precision-floor-fix-20260603

v-single-shot-paris-20260601-0539

v-working-e2e-20260601-0515

v0.1-e2e-working

fbc1e883f2 Add try/except around fused NVFP4 gate loading with error reporting biondizzle 2026-06-01 11:08:06 +00:00
5f38430423 Fix: use 1-dim tensors for gate_ws2 and gate_input_scale biondizzle 2026-06-01 11:05:09 +00:00
ec8f292112 Fix: use self.mma_tiler_mnk (full K=64) for SMEM layout computation biondizzle 2026-06-01 11:03:08 +00:00
44fb9b6c00 Fix: pass self.mma_tiler_mnk (full K) to _compute_stages, not self.mma_tiler (K=1 placeholder) biondizzle 2026-06-01 10:55:43 +00:00
be2bb2fe84 Fix: self.mma_tiler_mnk not mma_tiler_mnk biondizzle 2026-06-01 10:49:05 +00:00
c082843ecc Fix: mma_tiler K=1 placeholder in __init__, refined in _setup_attributes biondizzle 2026-06-01 10:42:21 +00:00
e0f60b9f05 Fix fused router: plain ints for mma_tiler + @cute.jit pattern biondizzle 2026-06-01 10:37:15 +00:00
057ae2101e CRITICAL FIX: Move tiled_mma creation and _setup_attributes OUTSIDE @cute.jit biondizzle 2026-06-01 10:28:01 +00:00
71deeb91a9 Quantize BF16 gate weight to NVFP4 for fused router + add global scales to GEMM biondizzle 2026-06-01 10:14:29 +00:00
24fed15ed6 Fix: convert PyTorch tensors to CuTe tensors for fused router kernel biondizzle 2026-06-01 10:02:40 +00:00
bab748763e Rewrite NVFP4 fused router kernel: MoE-style epilogue replaces broken SMEM merge biondizzle 2026-06-01 09:59:34 +00:00
31ebe4f2db Wire NVFP4 fused router kernel into e2e single-shot pipeline biondizzle 2026-06-01 09:47:48 +00:00
d9d3ca42b0 Fix: mma_tiler and cluster_layout must use MLIR values for cute.slice_ biondizzle 2026-06-01 09:42:17 +00:00
ec79f30709 Fix: PersistentTileSchedulerParams cluster_shape must be Python ints not MLIR values biondizzle 2026-06-01 09:38:08 +00:00
28d0cb4f41 Revert cutlass.Int32 wrapping — now inside @cute.jit, cute.round_up works biondizzle 2026-06-01 09:35:03 +00:00
b536f99192 CRITICAL FIX: move ALL CuTe DSL setup inside @cute.jit context biondizzle 2026-06-01 09:32:05 +00:00
65669596d4 Fix: all CuTe shape values must be cutlass.Int32 for MLIR compatibility biondizzle 2026-06-01 09:30:15 +00:00
df48dacc2b Fix: set mma_inst_shape_mn in __init__ before _create_tiled_mma call biondizzle 2026-06-01 09:22:24 +00:00
28f78420c2 Fix: quantize_activation_nvfp4 API - correct signature and return values biondizzle 2026-06-01 09:21:04 +00:00
7b3f6cb13c Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API biondizzle 2026-06-01 09:19:48 +00:00
483e759d53 Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic) biondizzle 2026-06-01 09:16:33 +00:00
2412745b21 Test fix: slice NVFP4 logits to actual expert count (GEMM padding) biondizzle 2026-06-01 09:15:06 +00:00
f33ca41c2a Fused router: replace nested if/else top-k with flat find-min-replace approach biondizzle 2026-06-01 09:13:53 +00:00
4f4ae8febd Test: enumerate CuTeDSL math API to check available operations biondizzle 2026-06-01 09:11:29 +00:00
9b86b2b414 Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup biondizzle 2026-06-01 08:56:20 +00:00
b94f8d4ed8 Test: fused router kernel vs BF16 reference path biondizzle 2026-06-01 08:54:24 +00:00
2433700a69 Fused router kernel: rewrite epilogue with proper CuTeDSL constructs biondizzle 2026-06-01 08:49:39 +00:00
d01b4b02de Complete NVFP4 fused router kernel: full MMA + router epilogue biondizzle 2026-06-01 08:37:10 +00:00
25b9a5f32d Fix test: use from_dlpack for c_tensor biondizzle 2026-06-01 07:55:29 +00:00
d2819fc39c Fix test: use as_tensor instead of make_tensor biondizzle 2026-06-01 07:54:36 +00:00
5ea71ebd78 Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles) biondizzle 2026-06-01 07:53:43 +00:00
fa6dbd4aa2 WIP: Rewrite NVFP4 fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16) biondizzle 2026-06-01 07:53:21 +00:00
4f706b55d7 Remove raw CUDA C++ fused router and DeepGEMM (MXFP4, wrong instruction) biondizzle 2026-06-01 07:51:31 +00:00
424fe6bf2c Fix: use SM100_MMA_MXF8F6F4_SS (not MXF4) to match Nvfp4Linear path biondizzle 2026-06-01 07:44:53 +00:00
2e2caadf7d WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives biondizzle 2026-06-01 07:41:42 +00:00
e3ea609ddd Embed DeepGEMM source (not submodule) for SM100 raw CUDA GEMM primitives biondizzle 2026-06-01 07:39:40 +00:00
dae83723a3 Add DeepGEMM as third-party dependency for SM100 raw CUDA GEMM primitives biondizzle 2026-06-01 07:39:38 +00:00
ef4c0ad489 Fix BF16 router mma_tiler: use cutlass.Int32 for CuTe DSL compatibility biondizzle 2026-06-01 07:29:30 +00:00
79be9cb8da Fix: hardcode mma_inst_shape_k=32 for NVFP4 (avoids MLIR unpack error in JIT) biondizzle 2026-06-01 07:20:23 +00:00
c3a64ceed7 Fix: mma_tiler must use CuTe Ints for static layout construction biondizzle 2026-06-01 07:19:15 +00:00
39b481e52b Ensure mma_tiler contains CuTe Ints for cute.slice_ compatibility biondizzle 2026-06-01 07:16:47 +00:00
57cc20d5ad Fix SFA/SFB SMEM: blockscaled layouts are plain Layout (no .outer/.inner swizzle) biondizzle 2026-06-01 07:14:45 +00:00
fcd7680583 Fix CuTe tensor creation: use from_dlpack + mark_layout_dynamic biondizzle 2026-06-01 07:12:52 +00:00
3a8c6daeb3 Fix: cutlass_torch.make_tensor -> as_tensor biondizzle 2026-06-01 07:11:43 +00:00
0553117af6 Simplify fused router test: compare fused vs 2-kernel NVFP4 path biondizzle 2026-06-01 07:10:55 +00:00
44a0e59808 Fix fused router test: use quantize_weight_to_nvfp4 (correct function name) biondizzle 2026-06-01 07:08:56 +00:00
940f37fb6c NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup v-nvfp4-fused-router-rewrite-20260601-0715 biondizzle 2026-06-01 07:08:12 +00:00
8658c8eca5 fix: add sf_vec_size parameter back to Nvfp4FusedRouterKernel __init__ biondizzle 2026-06-01 07:01:02 +00:00
b97f30e289 fix: store sf_vec_size as instance variable biondizzle 2026-06-01 06:56:33 +00:00
c225d195ea fix: remove tcgen05.mma.Kind (doesn't exist), use make_blockscaled_trivial_tiled_mma biondizzle 2026-06-01 06:54:49 +00:00
e6803b450d rewrite: simplified fused router test (reference + import check) biondizzle 2026-06-01 06:53:17 +00:00
262cec262d fix: add shape assertions to fused router test biondizzle 2026-06-01 06:51:47 +00:00
db07d17a62 fix: set activation global scale in fused router test biondizzle 2026-06-01 06:50:41 +00:00
2abb4a19d9 fix: set gs and ws2 fields for Nvfp4Linear in fused router test biondizzle 2026-06-01 06:49:43 +00:00
61c04f7152 fix: Nvfp4Linear field is sf not scale_b biondizzle 2026-06-01 06:48:39 +00:00
982f245c67 fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb) biondizzle 2026-06-01 06:47:15 +00:00
16af96380f fix: use internal fields for Nvfp4Linear weight setup in test biondizzle 2026-06-01 06:46:05 +00:00
7f1f224c78 fix: quantize_weight_to_nvfp4 returns 3 values, not 4 biondizzle 2026-06-01 06:43:53 +00:00
27fd847dd0 fix: correct quantize function name in fused router test biondizzle 2026-06-01 06:41:54 +00:00
0873d65253 test: add fused router kernel test biondizzle 2026-06-01 06:40:46 +00:00
90b2581dfe feat: NVFP4 fused router CuTeDSL kernel (WIP) biondizzle 2026-06-01 06:40:21 +00:00
6c28c57b6a feat: Nvfp4GroupedLinear for o_a_proj (replaces BF16 grouped BMM) v-nvfp4-router-oa-20260601-0610 biondizzle 2026-06-01 06:00:36 +00:00
cf2b7ab7ec feat: NVFP4 gate projection for router (replaces BF16 cuBLAS) biondizzle 2026-06-01 05:58:56 +00:00
9f14cb17d1 test: add compressor position_bias unit test biondizzle 2026-06-01 05:55:05 +00:00
84ca520bfb fix: move compressor position_bias into CUDA kernel (was Python loop) biondizzle 2026-06-01 05:54:44 +00:00
311fae490f tune: reduce verbose diagnostics, print every decode step v-e2e-paris-32tok-20260601-0549 biondizzle 2026-06-01 05:40:48 +00:00
df8acae66b fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks v-single-shot-paris-20260601-0539 biondizzle 2026-06-01 05:24:18 +00:00
62041b78bf fix: import torch.utils.cpp_extension explicitly in production_compress biondizzle 2026-06-01 05:20:44 +00:00
2155fd6c90 test: production compressor kernel unit test biondizzle 2026-06-01 05:19:13 +00:00
b380028c49 feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel biondizzle 2026-06-01 05:18:59 +00:00
6e53e3007c fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow v-working-e2e-20260601-0515 biondizzle 2026-06-01 04:59:06 +00:00
eb9c46f8cb test: quantize on different GPUs biondizzle 2026-06-01 04:48:30 +00:00
9ce7304783 test: direct SE L1 test on different GPUs biondizzle 2026-06-01 04:43:48 +00:00
ce608d0e50 test: fix gemm 1-group test params biondizzle 2026-06-01 04:40:07 +00:00
c652177970 test: fix gemm 1-group test biondizzle 2026-06-01 04:35:55 +00:00
793f062bbc auto: pre-test push for test_gemm_1group.py biondizzle 2026-06-01 04:32:29 +00:00
86cb0e64a6 auto: pre-test push for test_se_dequant.py biondizzle 2026-06-01 04:30:37 +00:00
9ba051cf49 test: fix gsa in SE multi-GPU test biondizzle 2026-06-01 04:26:03 +00:00
419112dd3e auto: pre-test push for test_se_multi_gpu.py biondizzle 2026-06-01 04:22:38 +00:00
2cbc7459b0 diag: fix SE scale print (cast to float first) biondizzle 2026-06-01 04:14:47 +00:00
bcd7a0cf0d diag: check SE weight and scale integrity for first 3 layers biondizzle 2026-06-01 04:08:21 +00:00
8ad617e2ff diag: NaN detection in shared expert gate/up split biondizzle 2026-06-01 04:01:46 +00:00
a53936a17c diag: print l1_out shape warning in shared expert biondizzle 2026-06-01 03:54:29 +00:00
db30c4acd6 auto: pre-test push for test_se_gpu.py biondizzle 2026-06-01 03:50:53 +00:00
3dd95ce77b fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them) biondizzle 2026-06-01 03:43:09 +00:00
27c63b01d6 diag: remove broken SE reference comparison, add gsa/gsb print biondizzle 2026-06-01 03:31:36 +00:00
9a27ed21fd diag: compare shared expert output with PyTorch reference biondizzle 2026-06-01 03:25:21 +00:00
ee8318ad58 diag: handle NaN in shared expert output print biondizzle 2026-06-01 03:16:25 +00:00
7000762309 diag: fix SE weight attribute name biondizzle 2026-06-01 03:09:11 +00:00
fba1c06cad diag: check SE weight integrity biondizzle 2026-06-01 03:02:44 +00:00
22d7cc9b7a diag: cuda sync check after shared expert for first 3 layers biondizzle 2026-06-01 02:56:28 +00:00
b85fcf4d6f diag: print SE global scales for first 3 layers biondizzle 2026-06-01 02:49:55 +00:00
48d93a6d2e diag: MoE input/output diagnostics for first 3 layers biondizzle 2026-06-01 02:41:12 +00:00
856a459a98 fix: init l1_gsa_list and l2_gsa_list biondizzle 2026-06-01 02:34:21 +00:00
66b98e5794 fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear) biondizzle 2026-06-01 02:31:12 +00:00
f4b444b456 fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale biondizzle 2026-06-01 02:19:35 +00:00
1eed28dd09 diag: compare production FMHA and NVFP4 linear output with PyTorch reference biondizzle 2026-06-01 02:12:39 +00:00
df394f8b40 fix: missing closing quote on string literal biondizzle 2026-06-01 02:02:14 +00:00
cfd2468c61 fix: decode loop also needs int32 token_ids for hash router biondizzle 2026-06-01 01:58:45 +00:00
905623793b fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N) biondizzle 2026-06-01 01:49:40 +00:00

... 2 3 4 5 6 ...