nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	e0f60b9f05	Fix fused router: plain ints for mma_tiler + @cute.jit pattern Root cause of previous crash: cutlass.Int32(128) wrapping of mma_inst_shape_mn caused _unpack_x_tuple to fail in cute.size(tiled_mma.shape_mnk, mode=[2]). The fused_swiglu kernel uses plain Python ints for mma_tiler_mnk and mma_inst_shape_mn — NOT cutlass.Int32. Inside @cute.jit, CuTeDSL auto-converts plain ints to MLIR values. The Int32 wrapping was unnecessary and actually harmful. Pattern: same as fused_swiglu.py __call__: - @cute.jit compiled_fn takes CuTe tensors - _setup_attributes called inside JIT (needs MLIR context) - cute.compile at the end	2026-06-01 10:37:15 +00:00
biondizzle	057ae2101e	CRITICAL FIX: Move tiled_mma creation and _setup_attributes OUTSIDE @cute.jit The _setup_attributes() calls cute.size(tiled_mma.shape_mnk, mode=[2]) which requires host-side execution. Inside @cute.jit, tiled_mma.shape_mnk returns MLIR values that can't be unpacked by cute.size(). This follows the fused_swiglu.py pattern exactly: setup on host side, then pass everything to the kernel. Removed @cute.jit wrapper entirely in favor of direct kernel launch (same as fused_swiglu).	2026-06-01 10:28:01 +00:00
biondizzle	71deeb91a9	Quantize BF16 gate weight to NVFP4 for fused router + add global scales to GEMM CRITICAL: Checkpoint stores gate weights as BF16, not NVFP4. Previous code fell back to BF16 cuBLAS because weight_scale was missing. Now we quantize the BF16 gate weight to NVFP4 at load time using quantize_to_nvfp4() and pass the result to the fused router kernel. Also added global scale (gsa, gsb) parameters to the kernel: - gsa (activation global scale) applied during activation quantization - gsb (weight global scale) applied in epilogue before sqrt(softplus) - The MMA output is (A * SFA) @ (B * SFB), missing gsagsb - Epilogue now computes sqrt(softplus(logit gsa * gsb)) instead of sqrt(softplus(logit))	2026-06-01 10:14:29 +00:00
biondizzle	24fed15ed6	Fix: convert PyTorch tensors to CuTe tensors for fused router kernel - Added cutlass_torch.from_dlpack() + mark_layout_dynamic() conversions - quantize_activation_nvfp4 returns (fp4_packed, fp8_scales) which are converted to CuTe tensors before passing to the kernel - Same pattern as gemm_runner.py	2026-06-01 10:02:40 +00:00
biondizzle	bab748763e	Rewrite NVFP4 fused router kernel: MoE-style epilogue replaces broken SMEM merge CRITICAL REWRITE of nvfp4_fused_router_kernel.py: - REMOVED: Raw pointer SMEM merge (storage.merge_scores.data_ptr()[idx] = val) This crashed the CuTeDSL MLIR optimizer. Never use raw pointer indexing inside CuTeDSL kernels. - REMOVED: Per-thread top-k accumulation + 128-thread SMEM merge. Too complex for MLIR, caused SIGABRT during compilation. - ADDED: MoE-style epilogue (TMEM→regs→activation→SMEM→TMA store→GMEM) using paired copy atoms from CUTLASS (epilogue_tmem_copy_and_partition + epilogue_smem_copy_and_partition). Structurally identical to the proven FusedSwiGLUScaledGroupedGemmKernel epilogue. This SHOULD compile. - Activation: sqrt(softplus(logit)) in registers (replaces SwiGLU) - Output: FP32 activated scores written to GMEM via TMA store - Top-k handled by activation_topk CUDA kernel in Python wrapper Other changes: - _activation_topk.py: Added run_fused_activation_topk_pre_activated() for top-k + renorm on pre-activated scores (PyTorch reference, not CUDA kernel) - dense_router_dispatch_nvfp4_fused: Updated to match new kernel API - Kernel now uses standard _compute_stages() for SMEM budget calculation - Kernel now uses compute_epilogue_tile_shape() for epi_tile (not hardcoded) - C pipeline (PipelineTmaStore) added for SMEM→GMEM overlap	2026-06-01 09:59:34 +00:00
biondizzle	31ebe4f2db	Wire NVFP4 fused router kernel into e2e single-shot pipeline - Add dense_router_dispatch_nvfp4_fused() in dense_router_decode.py: single-kernel NVFP4 blockscaled GEMM + fused router epilogue - Router.load_nvfp4_fused_gate(): stores raw NVFP4 tensors for fused path - Router._run_dense_impl() dispatch priority: fused > 2-kernel > BF16 - single_shot_inference.py: loads raw NVFP4 gate weights for fused kernel instead of building Nvfp4Linear (which was the 2-kernel path) - Fix selection sort bug in nvfp4_fused_router_kernel.py: pass 0 was missing t_s/t_i/t_a temp save before swap, causing undefined vars - Export dense_router_dispatch_nvfp4_fused from __init__.py	2026-06-01 09:47:48 +00:00
biondizzle	d9d3ca42b0	Fix: mma_tiler and cluster_layout must use MLIR values for cute.slice_ cute.slice_ on Python int tuples fails. All values in mma_tiler and cluster_layout need to be cutlass.Int32() since they flow into cute.slice_ and cute.local_tile inside @cute.kernel. Now consistent: mma_inst_shape_mn, mma_tiler, cluster_layout_vmnk all use MLIR-typed values created inside @cute.jit context.	2026-06-01 09:42:17 +00:00
biondizzle	ec79f30709	Fix: PersistentTileSchedulerParams cluster_shape must be Python ints not MLIR values	2026-06-01 09:38:08 +00:00
biondizzle	28d0cb4f41	Revert cutlass.Int32 wrapping — now inside @cute.jit, cute.round_up works All CuTe DSL calls now happen inside @cute.jit context, so cute.round_up and all layout operations have proper MLIR context. No need for manual Int32 wrapping or Python math workarounds.	2026-06-01 09:35:03 +00:00
biondizzle	b536f99192	CRITICAL FIX: move ALL CuTe DSL setup inside @cute.jit context The root cause of ALL the MLIR crashes: _create_tiled_mma and _setup_attributes call cute.make_tiled_mma, sm100_utils.make_smem_layout_a, etc. These are MLIR operations that REQUIRE an active MLIR context. Previously they ran in run() OUTSIDE @cute.jit, so there was no MLIR context — causing 'Expected an MLIR object (got None)' in _pack_shape. Now ALL CuTe DSL calls happen INSIDE the @cute.jit function, matching fused_swiglu's pattern where __call__ is called from JIT context. Grid computation uses plain Python math (no MLIR needed).	2026-06-01 09:32:05 +00:00
biondizzle	65669596d4	Fix: all CuTe shape values must be cutlass.Int32 for MLIR compatibility Python ints cause 'Expected an MLIR object (got None)' in _pack_shape. This is the same fix we applied to the FMHA kernel mma_tiler. All mma_inst_shape, mma_tiler, cluster_shape values now use cutlass.Int32().	2026-06-01 09:30:15 +00:00
biondizzle	df48dacc2b	Fix: set mma_inst_shape_mn in __init__ before _create_tiled_mma call	2026-06-01 09:22:24 +00:00
biondizzle	28f78420c2	Fix: quantize_activation_nvfp4 API - correct signature and return values	2026-06-01 09:21:04 +00:00
biondizzle	7b3f6cb13c	Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API - kernel wrapper converts torch tensors to CuTe tensors with mark_layout_dynamic - test uses the wrapper instead of calling kernel.run() directly - mat_b/scale_b are now torch tensors (converted inside wrapper)	2026-06-01 09:19:48 +00:00
biondizzle	483e759d53	Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic)	2026-06-01 09:16:33 +00:00
biondizzle	f33ca41c2a	Fused router: replace nested if/else top-k with flat find-min-replace approach The 5-level nested if/else for sorted insertion created O(2^5) MLIR regions that crashed the CuTeDSL MLIR optimizer (SIGABRT). New approach: - Find-min-replace: scan 6 entries to find minimum (sequential, 1-level nesting) - Replace the minimum if new score > min (flat conditionals by index) - Selection sort the final 6 entries after SMEM merge (descending order) - All conditionals are FLAT (at most 1 level of nesting) This should avoid the MLIR optimizer explosion while producing identical results.	2026-06-01 09:13:53 +00:00
biondizzle	4f4ae8febd	Test: enumerate CuTeDSL math API to check available operations	2026-06-01 09:11:29 +00:00
biondizzle	2433700a69	Fused router kernel: rewrite epilogue with proper CuTeDSL constructs - Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5) - Replace min-heap sift-down with fully unrolled sorted insertion (descending order, no dynamic indexing, no while loops) - Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors (s_merge_s, s_merge_i, s_merge_a) - Replace cute.where with cute.math.fmax - Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n - Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128) - Add iter_acc_early_release for overlapping accumulator - Rewrite test to compare fused kernel vs 2-kernel reference path - Remove stale memory doc	2026-06-01 08:49:39 +00:00
biondizzle	d01b4b02de	Complete NVFP4 fused router kernel: full MMA + router epilogue - TMA warp: persistent tile scheduling + TMA loads for A/B/SFA/SFB - MMA warp: blockscaled GEMM (tcgen05.mma.block_scale) with S2T copy for SFA/SFB, proper pipeline synchronization (AB + Acc pipelines) - Epilogue warps: TMEM->register via epilogue_tmem_copy_and_partition, sqrt(softplus) + e_bias + min-heap top-k + renormalization - Python wrapper: run_nvfp4_fused_router() with proper CuTe tensor creation via from_dlpack + mark_layout_dynamic - Single-kernel path, no BF16 fallback, no intermediate GMEM buffer - Following exact patterns from MoE fused_swiglu.py kernel	2026-06-01 08:37:10 +00:00
biondizzle	fa6dbd4aa2	WIP: Rewrite NVFP4 fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16) Uses kind::mxf4nvf4 — native NVF4 with E2M1 microscales, 16-elem blocks. NO MXFP4, NO CONVERSIONS. Kernel incomplete — GEMM mainloop mirrors dense.py but epilogue is TODO. Need to verify CuTeDSL compilation works with proper PipelineTmaUmma/ PipelineUmmaAsync abstractions before adding top-k epilogue.	2026-06-01 07:53:21 +00:00
biondizzle	4f706b55d7	Remove raw CUDA C++ fused router and DeepGEMM (MXFP4, wrong instruction) DeepGEMM uses kind::mxf4.block_scale.block32 (MXFP4, UE8M0 scales, 32-elem blocks). DSV4 uses NVF4: kind::mxf4nvf4 (E2M1 microscales, 16-elem blocks). Using MXFP4 would require E2M1->UE8M0 conversion. NO CONVERSIONS. Rewriting fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16).	2026-06-01 07:51:31 +00:00
biondizzle	424fe6bf2c	Fix: use SM100_MMA_MXF8F6F4_SS (not MXF4) to match Nvfp4Linear path MXF4 has .block32 hardcoded. MXF8F6F4 matches what CuTeDSL generates via make_instr_desc_block_scaled. Both use E2M1 data + UE8M0 scales at hardware level. NVFP4 E2M1 microscales are combined into UE8M0 during quantization — no MXFP4 conversion.	2026-06-01 07:44:53 +00:00
biondizzle	2e2caadf7d	WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives - nvfp4_fused_router_kernel.cuh: 1-CTA NVFP4 GEMM + sqrt(softplus) + top-k epilogue - Uses DeepGEMM SM100 primitives: SM100_MMA_MXF4_SS, UTCCP, UMMA descriptors - 4 warp roles: TMA load, UTCCP transpose, MMA issue, epilogue - nvfp4_fused_router_cuda.py: Python wrapper (TMA descriptor setup TBD) NOT YET COMPILING - needs: 1. SMEM layout fix (single extern __shared__) 2. TMA descriptor creation (cuTensorMapEncodeTiled) 3. Top-k cross-warp merge completion 4. FP4 tensor format alignment with DeepGEMM	2026-06-01 07:41:42 +00:00
biondizzle	ef4c0ad489	Fix BF16 router mma_tiler: use cutlass.Int32 for CuTe DSL compatibility	2026-06-01 07:29:30 +00:00
biondizzle	79be9cb8da	Fix: hardcode mma_inst_shape_k=32 for NVFP4 (avoids MLIR unpack error in JIT)	2026-06-01 07:20:23 +00:00
biondizzle	c3a64ceed7	Fix: mma_tiler must use CuTe Ints for static layout construction	2026-06-01 07:19:15 +00:00
biondizzle	39b481e52b	Ensure mma_tiler contains CuTe Ints for cute.slice_ compatibility	2026-06-01 07:16:47 +00:00
biondizzle	57cc20d5ad	Fix SFA/SFB SMEM: blockscaled layouts are plain Layout (no .outer/.inner swizzle)	2026-06-01 07:14:45 +00:00
biondizzle	fcd7680583	Fix CuTe tensor creation: use from_dlpack + mark_layout_dynamic	2026-06-01 07:12:52 +00:00
biondizzle	3a8c6daeb3	Fix: cutlass_torch.make_tensor -> as_tensor	2026-06-01 07:11:43 +00:00
biondizzle	940f37fb6c	NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup Major fixes: - Added tiled_mma_sfb creation (always CtaGroup.ONE, rounded N) - Added mma_tiler_sfb, cta_tile_shape_mnk_sfb, cluster_layout_sfb_vmnk - Use blockscaled_utils.make_smem_layout_sfa/sfb (with sf_vec_size) instead of sm100_utils (which doesn't support block-scaled SF layouts) - Proper TMEM column accounting for SFA + SFB + accumulator - Fixed make_blockscaled_trivial_tiled_mma argument order (a_dtype, b_dtype, a_major, b_major, sf_dtype, sf_vec_size, cta_group, mma_inst_shape) - Fixed SFB TMA atom to use tiled_mma_sfb and cluster_layout_sfb_vmnk - Fixed SFB partition_SFB to use tiled_mma_sfb.get_slice - Fixed SFB global tile partitioning to use mma_tiler_sfb - Fixed mainloop_s2t_copy_and_partition to use TMEM fragments (make_fragment_SFA/SFB) as the tSF parameter - Updated run_nvfp4_fused_router wrapper to accept processed weight tensors from Nvfp4Linear._mat_b and _scale_b - Updated test to properly build Nvfp4Linear and use processed weights The old code was a rough sketch that never worked — it was missing the entire tiled_mma_sfb infrastructure, used wrong SMEM layout functions, and had broken TMA atom setup for scale factors.	2026-06-01 07:08:12 +00:00
biondizzle	8658c8eca5	fix: add sf_vec_size parameter back to Nvfp4FusedRouterKernel __init__	2026-06-01 07:01:02 +00:00
biondizzle	b97f30e289	fix: store sf_vec_size as instance variable	2026-06-01 06:56:33 +00:00
biondizzle	c225d195ea	fix: remove tcgen05.mma.Kind (doesn't exist), use make_blockscaled_trivial_tiled_mma	2026-06-01 06:54:49 +00:00
biondizzle	90b2581dfe	feat: NVFP4 fused router CuTeDSL kernel (WIP) Single-kernel NVFP4 block-scaled GEMM + fused sqrt(softplus) + top-k epilogue. Avoids materializing intermediate FP32 logits to GMEM. Architecture: 6-warp specialization - Warp 5 (TMA): Load A, B, SFA, SFB from GMEM → SMEM - Warp 4 (MMA): NVFP4 block-scaled GEMM → FP32 accumulator in TMEM - Warps 0-3 (EPI): TMEM → registers → sqrt(softplus) + bias + top-k → GMEM Epilogue maintains per-thread min-heap across N subtiles, then merges all 128 threads' heaps in SMEM for final top-k selection. Mirrors Sm100BlockScaledPersistentDenseGemmKernel structure for TMA/MMA/SFA/SFB handling, with custom top-k epilogue replacing the standard SwiGLU + TMA store path. NOTE: This is WIP — needs compilation testing on B200. Several API details (tiled_mma_sfb, cluster_layout_sfb_vmnk) need to be passed through the kernel parameters properly.	2026-06-01 06:40:21 +00:00
biondizzle	cf2b7ab7ec	feat: NVFP4 gate projection for router (replaces BF16 cuBLAS) The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate projection when NVFP4 scales are available in the checkpoint. This replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core NVFP4 acceleration. Changes: - dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate fallback. New load_nvfp4_gate() method. - dsv4/kernels/router/dense_router_decode.py: add dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk - dsv4/kernels/router/__init__.py: export new function - single_shot_inference.py: load NVFP4 gate weights when available, fall back to BF16 when not	2026-06-01 05:58:56 +00:00
biondizzle	84ca520bfb	fix: move compressor position_bias into CUDA kernel (was Python loop) The compressor_reduce.cu kernel now adds position_bias to BOTH kv and gate values, matching the PyTorch reference. Previously the kernel only added it to gate, and a Python workaround loop was adding it to both before the kernel call (then passing None to the kernel). Changes: - compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA) - single_shot_inference.py: remove Python position_bias loop, pass self.ape directly to csa/hca_compress_production - production_compress.py: already supports position_bias passthrough	2026-06-01 05:54:44 +00:00
biondizzle	df8acae66b	fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks	2026-06-01 05:24:18 +00:00
biondizzle	62041b78bf	fix: import torch.utils.cpp_extension explicitly in production_compress	2026-06-01 05:20:44 +00:00
biondizzle	b380028c49	feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel - New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm One block per compressed entry, 128 threads, FP32 accumulation CSA: overlapping Ca/Cb streams (2m tokens per block) HCA: single stream (m tokens per block) Includes apply_kv_norm kernel (unweighted RMSNorm + weight) - New production_compress.py: Python wrapper for CUDA kernels - single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear for kv_proj, gate_proj, q_b_proj, weights_proj projections Then CUDA reduce kernel for softmax + weighted sum No more PyTorch reference nvfp4_linear_ref in compressor/indexer path	2026-06-01 05:18:59 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow	2026-06-01 04:59:06 +00:00
biondizzle	8ad617e2ff	diag: NaN detection in shared expert gate/up split	2026-06-01 04:01:46 +00:00
biondizzle	a53936a17c	diag: print l1_out shape warning in shared expert	2026-06-01 03:54:29 +00:00
biondizzle	62efde5c9f	fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused)	2026-06-01 01:01:15 +00:00
biondizzle	5591a725e1	fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM)	2026-06-01 00:59:18 +00:00
biondizzle	0ab5d8c317	fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths)	2026-06-01 00:56:00 +00:00
biondizzle	c339fe7ad9	fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error	2026-06-01 00:54:19 +00:00
biondizzle	e671780008	fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert Critical bug: checkpoint weights are (N_packed, K_packed) N-major format, but make_b_k_major expects (E, K_packed, N_packed) input. Without the permute, the K and N dimensions are swapped, producing garbage output with wrong dimensions (e.g., q_a output was 3584 instead of 1536). Also fix scale assembly: checkpoint scales are (N, K_sf) which should use assemble_raw_scales_2d3d_3d_side (no transpose), not assemble_scales_3d_side (which incorrectly transposes K_sf↔N).	2026-06-01 00:30:37 +00:00
biondizzle	e8a7a9256f	fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors, but checkpoint weights from safetensors are loaded as uint8. The uint8 and float4_e2m1fn_x2 have the same byte representation, so .view() is safe. Fixed in: - Nvfp4Linear.finalize_weights() - Nvfp4SharedExpert.finalize_weights() - Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)	2026-06-01 00:18:34 +00:00
biondizzle	172448514c	fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was being dropped entirely in the production pipeline. The dequant formula is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be folded into the GEMM's global_scale_b parameter. Fixes in: - Nvfp4Linear: ws2 field, folded in finalize_weights() - Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked() - Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights() - single_shot_inference.py: pass weight_scale_2 through all loading paths - Also fix missing o_a_prod key fallback in attention output	2026-06-01 00:10:50 +00:00

1 2 3 4 5 ...

561 Commits