nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	9b86b2b414	Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup - Use quantize_to_nvfp4 for weight quantization - Use quantize_activation_nvfp4 with computed global_scale - Get mat_b and scale_b from Nvfp4Linear after finalize_weights - Compare against both BF16 reference and NVFP4 GEMM reference	2026-06-01 08:56:20 +00:00
biondizzle	b94f8d4ed8	Test: fused router kernel vs BF16 reference path - BF16 GEMM + activation_topk as reference - NVFP4 GEMM + fused router epilogue as test target - Proper NVFP4 quantization and CuTe tensor creation - Cosine similarity and topk_ids matching validation	2026-06-01 08:54:24 +00:00
biondizzle	2433700a69	Fused router kernel: rewrite epilogue with proper CuTeDSL constructs - Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5) - Replace min-heap sift-down with fully unrolled sorted insertion (descending order, no dynamic indexing, no while loops) - Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors (s_merge_s, s_merge_i, s_merge_a) - Replace cute.where with cute.math.fmax - Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n - Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128) - Add iter_acc_early_release for overlapping accumulator - Rewrite test to compare fused kernel vs 2-kernel reference path - Remove stale memory doc	2026-06-01 08:49:39 +00:00
biondizzle	d01b4b02de	Complete NVFP4 fused router kernel: full MMA + router epilogue - TMA warp: persistent tile scheduling + TMA loads for A/B/SFA/SFB - MMA warp: blockscaled GEMM (tcgen05.mma.block_scale) with S2T copy for SFA/SFB, proper pipeline synchronization (AB + Acc pipelines) - Epilogue warps: TMEM->register via epilogue_tmem_copy_and_partition, sqrt(softplus) + e_bias + min-heap top-k + renormalization - Python wrapper: run_nvfp4_fused_router() with proper CuTe tensor creation via from_dlpack + mark_layout_dynamic - Single-kernel path, no BF16 fallback, no intermediate GMEM buffer - Following exact patterns from MoE fused_swiglu.py kernel	2026-06-01 08:37:10 +00:00
biondizzle	25b9a5f32d	Fix test: use from_dlpack for c_tensor	2026-06-01 07:55:29 +00:00
biondizzle	d2819fc39c	Fix test: use as_tensor instead of make_tensor	2026-06-01 07:54:36 +00:00
biondizzle	5ea71ebd78	Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles)	2026-06-01 07:53:43 +00:00
biondizzle	fa6dbd4aa2	WIP: Rewrite NVFP4 fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16) Uses kind::mxf4nvf4 — native NVF4 with E2M1 microscales, 16-elem blocks. NO MXFP4, NO CONVERSIONS. Kernel incomplete — GEMM mainloop mirrors dense.py but epilogue is TODO. Need to verify CuTeDSL compilation works with proper PipelineTmaUmma/ PipelineUmmaAsync abstractions before adding top-k epilogue.	2026-06-01 07:53:21 +00:00
biondizzle	4f706b55d7	Remove raw CUDA C++ fused router and DeepGEMM (MXFP4, wrong instruction) DeepGEMM uses kind::mxf4.block_scale.block32 (MXFP4, UE8M0 scales, 32-elem blocks). DSV4 uses NVF4: kind::mxf4nvf4 (E2M1 microscales, 16-elem blocks). Using MXFP4 would require E2M1->UE8M0 conversion. NO CONVERSIONS. Rewriting fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16).	2026-06-01 07:51:31 +00:00
biondizzle	424fe6bf2c	Fix: use SM100_MMA_MXF8F6F4_SS (not MXF4) to match Nvfp4Linear path MXF4 has .block32 hardcoded. MXF8F6F4 matches what CuTeDSL generates via make_instr_desc_block_scaled. Both use E2M1 data + UE8M0 scales at hardware level. NVFP4 E2M1 microscales are combined into UE8M0 during quantization — no MXFP4 conversion.	2026-06-01 07:44:53 +00:00
biondizzle	2e2caadf7d	WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives - nvfp4_fused_router_kernel.cuh: 1-CTA NVFP4 GEMM + sqrt(softplus) + top-k epilogue - Uses DeepGEMM SM100 primitives: SM100_MMA_MXF4_SS, UTCCP, UMMA descriptors - 4 warp roles: TMA load, UTCCP transpose, MMA issue, epilogue - nvfp4_fused_router_cuda.py: Python wrapper (TMA descriptor setup TBD) NOT YET COMPILING - needs: 1. SMEM layout fix (single extern __shared__) 2. TMA descriptor creation (cuTensorMapEncodeTiled) 3. Top-k cross-warp merge completion 4. FP4 tensor format alignment with DeepGEMM	2026-06-01 07:41:42 +00:00
biondizzle	e3ea609ddd	Embed DeepGEMM source (not submodule) for SM100 raw CUDA GEMM primitives	2026-06-01 07:39:40 +00:00
biondizzle	dae83723a3	Add DeepGEMM as third-party dependency for SM100 raw CUDA GEMM primitives	2026-06-01 07:39:38 +00:00
biondizzle	ef4c0ad489	Fix BF16 router mma_tiler: use cutlass.Int32 for CuTe DSL compatibility	2026-06-01 07:29:30 +00:00
biondizzle	79be9cb8da	Fix: hardcode mma_inst_shape_k=32 for NVFP4 (avoids MLIR unpack error in JIT)	2026-06-01 07:20:23 +00:00
biondizzle	c3a64ceed7	Fix: mma_tiler must use CuTe Ints for static layout construction	2026-06-01 07:19:15 +00:00
biondizzle	39b481e52b	Ensure mma_tiler contains CuTe Ints for cute.slice_ compatibility	2026-06-01 07:16:47 +00:00
biondizzle	57cc20d5ad	Fix SFA/SFB SMEM: blockscaled layouts are plain Layout (no .outer/.inner swizzle)	2026-06-01 07:14:45 +00:00
biondizzle	fcd7680583	Fix CuTe tensor creation: use from_dlpack + mark_layout_dynamic	2026-06-01 07:12:52 +00:00
biondizzle	3a8c6daeb3	Fix: cutlass_torch.make_tensor -> as_tensor	2026-06-01 07:11:43 +00:00
biondizzle	0553117af6	Simplify fused router test: compare fused vs 2-kernel NVFP4 path	2026-06-01 07:10:55 +00:00
biondizzle	44a0e59808	Fix fused router test: use quantize_weight_to_nvfp4 (correct function name)	2026-06-01 07:08:56 +00:00
biondizzle	940f37fb6c	NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup Major fixes: - Added tiled_mma_sfb creation (always CtaGroup.ONE, rounded N) - Added mma_tiler_sfb, cta_tile_shape_mnk_sfb, cluster_layout_sfb_vmnk - Use blockscaled_utils.make_smem_layout_sfa/sfb (with sf_vec_size) instead of sm100_utils (which doesn't support block-scaled SF layouts) - Proper TMEM column accounting for SFA + SFB + accumulator - Fixed make_blockscaled_trivial_tiled_mma argument order (a_dtype, b_dtype, a_major, b_major, sf_dtype, sf_vec_size, cta_group, mma_inst_shape) - Fixed SFB TMA atom to use tiled_mma_sfb and cluster_layout_sfb_vmnk - Fixed SFB partition_SFB to use tiled_mma_sfb.get_slice - Fixed SFB global tile partitioning to use mma_tiler_sfb - Fixed mainloop_s2t_copy_and_partition to use TMEM fragments (make_fragment_SFA/SFB) as the tSF parameter - Updated run_nvfp4_fused_router wrapper to accept processed weight tensors from Nvfp4Linear._mat_b and _scale_b - Updated test to properly build Nvfp4Linear and use processed weights The old code was a rough sketch that never worked — it was missing the entire tiled_mma_sfb infrastructure, used wrong SMEM layout functions, and had broken TMA atom setup for scale factors. v-nvfp4-fused-router-rewrite-20260601-0715	2026-06-01 07:08:12 +00:00
biondizzle	8658c8eca5	fix: add sf_vec_size parameter back to Nvfp4FusedRouterKernel __init__	2026-06-01 07:01:02 +00:00
biondizzle	b97f30e289	fix: store sf_vec_size as instance variable	2026-06-01 06:56:33 +00:00
biondizzle	c225d195ea	fix: remove tcgen05.mma.Kind (doesn't exist), use make_blockscaled_trivial_tiled_mma	2026-06-01 06:54:49 +00:00
biondizzle	e6803b450d	rewrite: simplified fused router test (reference + import check)	2026-06-01 06:53:17 +00:00
biondizzle	262cec262d	fix: add shape assertions to fused router test	2026-06-01 06:51:47 +00:00
biondizzle	db07d17a62	fix: set activation global scale in fused router test	2026-06-01 06:50:41 +00:00
biondizzle	2abb4a19d9	fix: set gs and ws2 fields for Nvfp4Linear in fused router test	2026-06-01 06:49:43 +00:00
biondizzle	61c04f7152	fix: Nvfp4Linear field is sf not scale_b	2026-06-01 06:48:39 +00:00
biondizzle	982f245c67	fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb)	2026-06-01 06:47:15 +00:00
biondizzle	16af96380f	fix: use internal fields for Nvfp4Linear weight setup in test	2026-06-01 06:46:05 +00:00
biondizzle	7f1f224c78	fix: quantize_weight_to_nvfp4 returns 3 values, not 4	2026-06-01 06:43:53 +00:00
biondizzle	27fd847dd0	fix: correct quantize function name in fused router test	2026-06-01 06:41:54 +00:00
biondizzle	0873d65253	test: add fused router kernel test Compares NVFP4 fused CuTeDSL kernel against reference (Nvfp4Linear + activation_topk) for correctness.	2026-06-01 06:40:46 +00:00
biondizzle	90b2581dfe	feat: NVFP4 fused router CuTeDSL kernel (WIP) Single-kernel NVFP4 block-scaled GEMM + fused sqrt(softplus) + top-k epilogue. Avoids materializing intermediate FP32 logits to GMEM. Architecture: 6-warp specialization - Warp 5 (TMA): Load A, B, SFA, SFB from GMEM → SMEM - Warp 4 (MMA): NVFP4 block-scaled GEMM → FP32 accumulator in TMEM - Warps 0-3 (EPI): TMEM → registers → sqrt(softplus) + bias + top-k → GMEM Epilogue maintains per-thread min-heap across N subtiles, then merges all 128 threads' heaps in SMEM for final top-k selection. Mirrors Sm100BlockScaledPersistentDenseGemmKernel structure for TMA/MMA/SFA/SFB handling, with custom top-k epilogue replacing the standard SwiGLU + TMA store path. NOTE: This is WIP — needs compilation testing on B200. Several API details (tiled_mma_sfb, cluster_layout_sfb_vmnk) need to be passed through the kernel parameters properly.	2026-06-01 06:40:21 +00:00
biondizzle	6c28c57b6a	feat: Nvfp4GroupedLinear for o_a_proj (replaces BF16 grouped BMM) The attention output projection first half (wo_a) was using BF16 grouped BMM (torch.bmm). Now uses production Nvfp4GroupedLinear which performs the same grouped GEMM with NVFP4 tensor-core acceleration on Blackwell. The weight is loaded from NVFP4 checkpoint if available, otherwise quantized from BF16 via set_bf16_weight(). Also includes: - NVFP4 gate projection for router (from previous commit) - Compressor position_bias in CUDA kernel (from earlier fix) v-nvfp4-router-oa-20260601-0610	2026-06-01 06:00:36 +00:00
biondizzle	cf2b7ab7ec	feat: NVFP4 gate projection for router (replaces BF16 cuBLAS) The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate projection when NVFP4 scales are available in the checkpoint. This replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core NVFP4 acceleration. Changes: - dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate fallback. New load_nvfp4_gate() method. - dsv4/kernels/router/dense_router_decode.py: add dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk - dsv4/kernels/router/__init__.py: export new function - single_shot_inference.py: load NVFP4 gate weights when available, fall back to BF16 when not	2026-06-01 05:58:56 +00:00
biondizzle	9f14cb17d1	test: add compressor position_bias unit test Verifies CUDA kernel matches PyTorch reference with and without position_bias for both CSA (m=4) and HCA (m=128) paths.	2026-06-01 05:55:05 +00:00
biondizzle	84ca520bfb	fix: move compressor position_bias into CUDA kernel (was Python loop) The compressor_reduce.cu kernel now adds position_bias to BOTH kv and gate values, matching the PyTorch reference. Previously the kernel only added it to gate, and a Python workaround loop was adding it to both before the kernel call (then passing None to the kernel). Changes: - compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA) - single_shot_inference.py: remove Python position_bias loop, pass self.ape directly to csa/hca_compress_production - production_compress.py: already supports position_bias passthrough	2026-06-01 05:54:44 +00:00
biondizzle	311fae490f	tune: reduce verbose diagnostics, print every decode step v-e2e-paris-32tok-20260601-0549	2026-06-01 05:40:48 +00:00
biondizzle	df8acae66b	fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks v-single-shot-paris-20260601-0539	2026-06-01 05:24:18 +00:00
biondizzle	62041b78bf	fix: import torch.utils.cpp_extension explicitly in production_compress	2026-06-01 05:20:44 +00:00
biondizzle	2155fd6c90	test: production compressor kernel unit test	2026-06-01 05:19:13 +00:00
biondizzle	b380028c49	feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel - New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm One block per compressed entry, 128 threads, FP32 accumulation CSA: overlapping Ca/Cb streams (2m tokens per block) HCA: single stream (m tokens per block) Includes apply_kv_norm kernel (unweighted RMSNorm + weight) - New production_compress.py: Python wrapper for CUDA kernels - single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear for kv_proj, gate_proj, q_b_proj, weights_proj projections Then CUDA reduce kernel for softmax + weighted sum No more PyTorch reference nvfp4_linear_ref in compressor/indexer path	2026-06-01 05:18:59 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow v-working-e2e-20260601-0515	2026-06-01 04:59:06 +00:00
biondizzle	eb9c46f8cb	test: quantize on different GPUs	2026-06-01 04:48:30 +00:00
biondizzle	9ce7304783	test: direct SE L1 test on different GPUs	2026-06-01 04:43:48 +00:00
biondizzle	ce608d0e50	test: fix gemm 1-group test params	2026-06-01 04:40:07 +00:00

1 2 3 4 5 ...

2077 Commits