nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	9a3bb43f20	Set default max-tokens=512 for reasoning model	2026-06-01 17:27:01 +00:00
biondizzle	db6e3545da	Fix: add _use_runtime_gsa=True to router gate GEMM in single_shot The checkpoint-path gate was using the checkpoint's input_scale as gsa — the same E4M3 overflow bug we fixed in Nvfp4Linear/Nvfp4MoE/etc. The runtime-quantized BF16 path was using 1/(6*448) as a fixed gsa. Both now compute gsa from actual activation magnitude at runtime.	2026-06-01 17:25:04 +00:00
biondizzle	9d57b0453b	auto: pre-test commit	2026-06-01 15:04:46 +00:00
biondizzle	1a6d9ee29b	Reset to greedy decoding (temperature=0)	2026-06-01 15:04:02 +00:00
biondizzle	038fe81c68	Fix MoE non-fused L2 runtime gsa + update test harness for extra args	2026-06-01 15:03:54 +00:00
biondizzle	a48d6e14ae	Default temperature=0.7 with rep penalty	2026-06-01 14:55:43 +00:00
biondizzle	1d64b863ca	Add temperature sampling + repetition penalty to fix degenerate repetition With --temperature 0.7 --repetition-penalty 1.2, the model should generate more diverse text instead of repeating 'France' endlessly.	2026-06-01 14:54:49 +00:00
biondizzle	6cca16f97a	Set max-tokens=128 default, clean up for final verification	2026-06-01 14:43:48 +00:00
biondizzle	a0e758ec3b	Set default max-tokens=30 for faster iteration	2026-06-01 14:33:55 +00:00
biondizzle	2b1fca6dae	CRITICAL FIX: runtime activation global scale to prevent E4M3 overflow The checkpoint's input_scale was designed for training-time FP8 quantization, not NVFP4 activation quantization. Using it as gsa causes x/gsa to exceed the E4M3 block scale maximum (448), leading to systematic magnitude loss in every projection. This accumulates over 61 layers, compressing the logit range and producing garbage tokens. Fix: compute gsa at runtime from actual activation magnitude: gsa = max(\|x\|) / (6.0 * 448.0) This ensures x/gsa ≤ 2688 (the maximum representable in E4M3 block scales). Applied to: Nvfp4Linear, Nvfp4GroupedLinear, Nvfp4MoE, Nvfp4SharedExpert, Router gate	2026-06-01 14:21:16 +00:00
biondizzle	3b2714410f	Add NVFP4 linear accuracy test: prod vs ref with all-ones input	2026-06-01 14:15:27 +00:00
biondizzle	3e47d5f20a	Add prod vs ref GEMM comparison test + gate logits diagnostic	2026-06-01 14:11:37 +00:00
biondizzle	ad143afe37	Add L58-60 diagnostic: mHC A/B/C, MoE routed/shared, topk	2026-06-01 13:55:55 +00:00
biondizzle	7a05d3d3af	NVFP4 router gate: use Nvfp4Linear for both checkpoint and quantized paths - Checkpoint path: load NVFP4 gate weight directly into Nvfp4Linear - BF16 path: quantize and load into Nvfp4Linear - Both paths use proven production GEMM (no custom kernel) - load_nvfp4_fused_gate now creates Nvfp4Linear from BF16 weight	2026-06-01 11:25:50 +00:00
biondizzle	e5dbe1ed22	Switch router to Nvfp4Linear production GEMM (custom CuTeDSL kernel crashes MLIR) The custom fused router kernel crashes the CuTeDSL MLIR optimizer even with a simplified epilogue. Switch to the proven Nvfp4Linear path which uses the same NVFP4 Blackwell tensor-core GEMM, just with 2 kernel launches (GEMM + activation_topk) instead of 1. - Router's load_nvfp4_fused_gate now stores raw tensors for future use - single_shot_inference.py creates Nvfp4Linear from quantized gate weight - _run_dense_impl prioritizes gate_lin (NVFP4) over BF16 fallback	2026-06-01 11:17:54 +00:00
biondizzle	a4324781c3	Fix: properly remove sqrt(softplus) from CuTeDSL kernel Previous Python string replacement didn't match. Now using edit tool. Kernel writes raw FP32 logits with gsa*gsb applied. sqrt(softplus) is done in PyTorch after the kernel returns.	2026-06-01 11:14:04 +00:00
biondizzle	6efe90cd85	Move sqrt(softplus) out of CuTeDSL kernel into Python The CuTeDSL MLIR optimizer crashes (SIGABRT/core dump) on the combination of exp+log+sqrt in a for-range loop. The kernel now writes raw FP32 logits (with gsa*gsb applied) and sqrt(softplus) is done in PyTorch post-kernel. The GEMM is still pure NVFP4 Blackwell tensor cores.	2026-06-01 11:12:41 +00:00
biondizzle	fbc1e883f2	Add try/except around fused NVFP4 gate loading with error reporting If the fused kernel path fails, fall back to BF16 cuBLAS instead of crashing. This lets us see the actual error and continue testing.	2026-06-01 11:08:06 +00:00
biondizzle	5f38430423	Fix: use 1-dim tensors for gate_ws2 and gate_input_scale	2026-06-01 11:05:09 +00:00
biondizzle	ec8f292112	Fix: use self.mma_tiler_mnk (full K=64) for SMEM layout computation SFA/SFB SMEM layouts need the full K dimension to compute the correct number of K-tiles. self.mma_tiler has K=1 (placeholder for cute.slice_) which gives 0 K-tiles and zero-dimension SMEM shapes.	2026-06-01 11:03:08 +00:00
biondizzle	44fb9b6c00	Fix: pass self.mma_tiler_mnk (full K) to _compute_stages, not self.mma_tiler (K=1 placeholder)	2026-06-01 10:55:43 +00:00
biondizzle	be2bb2fe84	Fix: self.mma_tiler_mnk not mma_tiler_mnk	2026-06-01 10:49:05 +00:00
biondizzle	c082843ecc	Fix: mma_tiler K=1 placeholder in __init__, refined in _setup_attributes Same pattern as fused_swiglu.py: - __init__ sets mma_tiler = (M, N, 1) with K=1 placeholder - _setup_attributes refines K to the actual value from cute.size(tiled_mma.shape_mnk) - cute.slice_ and cute.local_tile work correctly with the K=1 initial value - mma_tiler_sfb also gets K=1 placeholder This fixes the MLIR crash on cute.slice_(self.mma_tiler, (None, 0, None)) which couldn't handle the full (128, 128, 64) tuple.	2026-06-01 10:42:21 +00:00
biondizzle	e0f60b9f05	Fix fused router: plain ints for mma_tiler + @cute.jit pattern Root cause of previous crash: cutlass.Int32(128) wrapping of mma_inst_shape_mn caused _unpack_x_tuple to fail in cute.size(tiled_mma.shape_mnk, mode=[2]). The fused_swiglu kernel uses plain Python ints for mma_tiler_mnk and mma_inst_shape_mn — NOT cutlass.Int32. Inside @cute.jit, CuTeDSL auto-converts plain ints to MLIR values. The Int32 wrapping was unnecessary and actually harmful. Pattern: same as fused_swiglu.py __call__: - @cute.jit compiled_fn takes CuTe tensors - _setup_attributes called inside JIT (needs MLIR context) - cute.compile at the end	2026-06-01 10:37:15 +00:00
biondizzle	057ae2101e	CRITICAL FIX: Move tiled_mma creation and _setup_attributes OUTSIDE @cute.jit The _setup_attributes() calls cute.size(tiled_mma.shape_mnk, mode=[2]) which requires host-side execution. Inside @cute.jit, tiled_mma.shape_mnk returns MLIR values that can't be unpacked by cute.size(). This follows the fused_swiglu.py pattern exactly: setup on host side, then pass everything to the kernel. Removed @cute.jit wrapper entirely in favor of direct kernel launch (same as fused_swiglu).	2026-06-01 10:28:01 +00:00
biondizzle	71deeb91a9	Quantize BF16 gate weight to NVFP4 for fused router + add global scales to GEMM CRITICAL: Checkpoint stores gate weights as BF16, not NVFP4. Previous code fell back to BF16 cuBLAS because weight_scale was missing. Now we quantize the BF16 gate weight to NVFP4 at load time using quantize_to_nvfp4() and pass the result to the fused router kernel. Also added global scale (gsa, gsb) parameters to the kernel: - gsa (activation global scale) applied during activation quantization - gsb (weight global scale) applied in epilogue before sqrt(softplus) - The MMA output is (A * SFA) @ (B * SFB), missing gsagsb - Epilogue now computes sqrt(softplus(logit gsa * gsb)) instead of sqrt(softplus(logit))	2026-06-01 10:14:29 +00:00
biondizzle	24fed15ed6	Fix: convert PyTorch tensors to CuTe tensors for fused router kernel - Added cutlass_torch.from_dlpack() + mark_layout_dynamic() conversions - quantize_activation_nvfp4 returns (fp4_packed, fp8_scales) which are converted to CuTe tensors before passing to the kernel - Same pattern as gemm_runner.py	2026-06-01 10:02:40 +00:00
biondizzle	bab748763e	Rewrite NVFP4 fused router kernel: MoE-style epilogue replaces broken SMEM merge CRITICAL REWRITE of nvfp4_fused_router_kernel.py: - REMOVED: Raw pointer SMEM merge (storage.merge_scores.data_ptr()[idx] = val) This crashed the CuTeDSL MLIR optimizer. Never use raw pointer indexing inside CuTeDSL kernels. - REMOVED: Per-thread top-k accumulation + 128-thread SMEM merge. Too complex for MLIR, caused SIGABRT during compilation. - ADDED: MoE-style epilogue (TMEM→regs→activation→SMEM→TMA store→GMEM) using paired copy atoms from CUTLASS (epilogue_tmem_copy_and_partition + epilogue_smem_copy_and_partition). Structurally identical to the proven FusedSwiGLUScaledGroupedGemmKernel epilogue. This SHOULD compile. - Activation: sqrt(softplus(logit)) in registers (replaces SwiGLU) - Output: FP32 activated scores written to GMEM via TMA store - Top-k handled by activation_topk CUDA kernel in Python wrapper Other changes: - _activation_topk.py: Added run_fused_activation_topk_pre_activated() for top-k + renorm on pre-activated scores (PyTorch reference, not CUDA kernel) - dense_router_dispatch_nvfp4_fused: Updated to match new kernel API - Kernel now uses standard _compute_stages() for SMEM budget calculation - Kernel now uses compute_epilogue_tile_shape() for epi_tile (not hardcoded) - C pipeline (PipelineTmaStore) added for SMEM→GMEM overlap	2026-06-01 09:59:34 +00:00
biondizzle	31ebe4f2db	Wire NVFP4 fused router kernel into e2e single-shot pipeline - Add dense_router_dispatch_nvfp4_fused() in dense_router_decode.py: single-kernel NVFP4 blockscaled GEMM + fused router epilogue - Router.load_nvfp4_fused_gate(): stores raw NVFP4 tensors for fused path - Router._run_dense_impl() dispatch priority: fused > 2-kernel > BF16 - single_shot_inference.py: loads raw NVFP4 gate weights for fused kernel instead of building Nvfp4Linear (which was the 2-kernel path) - Fix selection sort bug in nvfp4_fused_router_kernel.py: pass 0 was missing t_s/t_i/t_a temp save before swap, causing undefined vars - Export dense_router_dispatch_nvfp4_fused from __init__.py	2026-06-01 09:47:48 +00:00
biondizzle	d9d3ca42b0	Fix: mma_tiler and cluster_layout must use MLIR values for cute.slice_ cute.slice_ on Python int tuples fails. All values in mma_tiler and cluster_layout need to be cutlass.Int32() since they flow into cute.slice_ and cute.local_tile inside @cute.kernel. Now consistent: mma_inst_shape_mn, mma_tiler, cluster_layout_vmnk all use MLIR-typed values created inside @cute.jit context.	2026-06-01 09:42:17 +00:00
biondizzle	ec79f30709	Fix: PersistentTileSchedulerParams cluster_shape must be Python ints not MLIR values	2026-06-01 09:38:08 +00:00
biondizzle	28d0cb4f41	Revert cutlass.Int32 wrapping — now inside @cute.jit, cute.round_up works All CuTe DSL calls now happen inside @cute.jit context, so cute.round_up and all layout operations have proper MLIR context. No need for manual Int32 wrapping or Python math workarounds.	2026-06-01 09:35:03 +00:00
biondizzle	b536f99192	CRITICAL FIX: move ALL CuTe DSL setup inside @cute.jit context The root cause of ALL the MLIR crashes: _create_tiled_mma and _setup_attributes call cute.make_tiled_mma, sm100_utils.make_smem_layout_a, etc. These are MLIR operations that REQUIRE an active MLIR context. Previously they ran in run() OUTSIDE @cute.jit, so there was no MLIR context — causing 'Expected an MLIR object (got None)' in _pack_shape. Now ALL CuTe DSL calls happen INSIDE the @cute.jit function, matching fused_swiglu's pattern where __call__ is called from JIT context. Grid computation uses plain Python math (no MLIR needed).	2026-06-01 09:32:05 +00:00
biondizzle	65669596d4	Fix: all CuTe shape values must be cutlass.Int32 for MLIR compatibility Python ints cause 'Expected an MLIR object (got None)' in _pack_shape. This is the same fix we applied to the FMHA kernel mma_tiler. All mma_inst_shape, mma_tiler, cluster_shape values now use cutlass.Int32().	2026-06-01 09:30:15 +00:00
biondizzle	df48dacc2b	Fix: set mma_inst_shape_mn in __init__ before _create_tiled_mma call	2026-06-01 09:22:24 +00:00
biondizzle	28f78420c2	Fix: quantize_activation_nvfp4 API - correct signature and return values	2026-06-01 09:21:04 +00:00
biondizzle	7b3f6cb13c	Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API - kernel wrapper converts torch tensors to CuTe tensors with mark_layout_dynamic - test uses the wrapper instead of calling kernel.run() directly - mat_b/scale_b are now torch tensors (converted inside wrapper)	2026-06-01 09:19:48 +00:00
biondizzle	483e759d53	Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic)	2026-06-01 09:16:33 +00:00
biondizzle	2412745b21	Test fix: slice NVFP4 logits to actual expert count (GEMM padding)	2026-06-01 09:15:06 +00:00
biondizzle	f33ca41c2a	Fused router: replace nested if/else top-k with flat find-min-replace approach The 5-level nested if/else for sorted insertion created O(2^5) MLIR regions that crashed the CuTeDSL MLIR optimizer (SIGABRT). New approach: - Find-min-replace: scan 6 entries to find minimum (sequential, 1-level nesting) - Replace the minimum if new score > min (flat conditionals by index) - Selection sort the final 6 entries after SMEM merge (descending order) - All conditionals are FLAT (at most 1 level of nesting) This should avoid the MLIR optimizer explosion while producing identical results.	2026-06-01 09:13:53 +00:00
biondizzle	4f4ae8febd	Test: enumerate CuTeDSL math API to check available operations	2026-06-01 09:11:29 +00:00
biondizzle	9b86b2b414	Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup - Use quantize_to_nvfp4 for weight quantization - Use quantize_activation_nvfp4 with computed global_scale - Get mat_b and scale_b from Nvfp4Linear after finalize_weights - Compare against both BF16 reference and NVFP4 GEMM reference	2026-06-01 08:56:20 +00:00
biondizzle	b94f8d4ed8	Test: fused router kernel vs BF16 reference path - BF16 GEMM + activation_topk as reference - NVFP4 GEMM + fused router epilogue as test target - Proper NVFP4 quantization and CuTe tensor creation - Cosine similarity and topk_ids matching validation	2026-06-01 08:54:24 +00:00
biondizzle	2433700a69	Fused router kernel: rewrite epilogue with proper CuTeDSL constructs - Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5) - Replace min-heap sift-down with fully unrolled sorted insertion (descending order, no dynamic indexing, no while loops) - Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors (s_merge_s, s_merge_i, s_merge_a) - Replace cute.where with cute.math.fmax - Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n - Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128) - Add iter_acc_early_release for overlapping accumulator - Rewrite test to compare fused kernel vs 2-kernel reference path - Remove stale memory doc	2026-06-01 08:49:39 +00:00
biondizzle	d01b4b02de	Complete NVFP4 fused router kernel: full MMA + router epilogue - TMA warp: persistent tile scheduling + TMA loads for A/B/SFA/SFB - MMA warp: blockscaled GEMM (tcgen05.mma.block_scale) with S2T copy for SFA/SFB, proper pipeline synchronization (AB + Acc pipelines) - Epilogue warps: TMEM->register via epilogue_tmem_copy_and_partition, sqrt(softplus) + e_bias + min-heap top-k + renormalization - Python wrapper: run_nvfp4_fused_router() with proper CuTe tensor creation via from_dlpack + mark_layout_dynamic - Single-kernel path, no BF16 fallback, no intermediate GMEM buffer - Following exact patterns from MoE fused_swiglu.py kernel	2026-06-01 08:37:10 +00:00
biondizzle	25b9a5f32d	Fix test: use from_dlpack for c_tensor	2026-06-01 07:55:29 +00:00
biondizzle	d2819fc39c	Fix test: use as_tensor instead of make_tensor	2026-06-01 07:54:36 +00:00
biondizzle	5ea71ebd78	Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles)	2026-06-01 07:53:43 +00:00
biondizzle	fa6dbd4aa2	WIP: Rewrite NVFP4 fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16) Uses kind::mxf4nvf4 — native NVF4 with E2M1 microscales, 16-elem blocks. NO MXFP4, NO CONVERSIONS. Kernel incomplete — GEMM mainloop mirrors dense.py but epilogue is TODO. Need to verify CuTeDSL compilation works with proper PipelineTmaUmma/ PipelineUmmaAsync abstractions before adding top-k epilogue.	2026-06-01 07:53:21 +00:00
biondizzle	4f706b55d7	Remove raw CUDA C++ fused router and DeepGEMM (MXFP4, wrong instruction) DeepGEMM uses kind::mxf4.block_scale.block32 (MXFP4, UE8M0 scales, 32-elem blocks). DSV4 uses NVF4: kind::mxf4nvf4 (E2M1 microscales, 16-elem blocks). Using MXFP4 would require E2M1->UE8M0 conversion. NO CONVERSIONS. Rewriting fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16).	2026-06-01 07:51:31 +00:00

1 2 3 4 5 ...

2118 Commits