nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2e2caadf7d	WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives - nvfp4_fused_router_kernel.cuh: 1-CTA NVFP4 GEMM + sqrt(softplus) + top-k epilogue - Uses DeepGEMM SM100 primitives: SM100_MMA_MXF4_SS, UTCCP, UMMA descriptors - 4 warp roles: TMA load, UTCCP transpose, MMA issue, epilogue - nvfp4_fused_router_cuda.py: Python wrapper (TMA descriptor setup TBD) NOT YET COMPILING - needs: 1. SMEM layout fix (single extern __shared__) 2. TMA descriptor creation (cuTensorMapEncodeTiled) 3. Top-k cross-warp merge completion 4. FP4 tensor format alignment with DeepGEMM	2026-06-01 07:41:42 +00:00
biondizzle	e3ea609ddd	Embed DeepGEMM source (not submodule) for SM100 raw CUDA GEMM primitives	2026-06-01 07:39:40 +00:00
biondizzle	dae83723a3	Add DeepGEMM as third-party dependency for SM100 raw CUDA GEMM primitives	2026-06-01 07:39:38 +00:00
biondizzle	ef4c0ad489	Fix BF16 router mma_tiler: use cutlass.Int32 for CuTe DSL compatibility	2026-06-01 07:29:30 +00:00
biondizzle	79be9cb8da	Fix: hardcode mma_inst_shape_k=32 for NVFP4 (avoids MLIR unpack error in JIT)	2026-06-01 07:20:23 +00:00
biondizzle	c3a64ceed7	Fix: mma_tiler must use CuTe Ints for static layout construction	2026-06-01 07:19:15 +00:00
biondizzle	39b481e52b	Ensure mma_tiler contains CuTe Ints for cute.slice_ compatibility	2026-06-01 07:16:47 +00:00
biondizzle	57cc20d5ad	Fix SFA/SFB SMEM: blockscaled layouts are plain Layout (no .outer/.inner swizzle)	2026-06-01 07:14:45 +00:00
biondizzle	fcd7680583	Fix CuTe tensor creation: use from_dlpack + mark_layout_dynamic	2026-06-01 07:12:52 +00:00
biondizzle	3a8c6daeb3	Fix: cutlass_torch.make_tensor -> as_tensor	2026-06-01 07:11:43 +00:00
biondizzle	0553117af6	Simplify fused router test: compare fused vs 2-kernel NVFP4 path	2026-06-01 07:10:55 +00:00
biondizzle	44a0e59808	Fix fused router test: use quantize_weight_to_nvfp4 (correct function name)	2026-06-01 07:08:56 +00:00
biondizzle	940f37fb6c	NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup Major fixes: - Added tiled_mma_sfb creation (always CtaGroup.ONE, rounded N) - Added mma_tiler_sfb, cta_tile_shape_mnk_sfb, cluster_layout_sfb_vmnk - Use blockscaled_utils.make_smem_layout_sfa/sfb (with sf_vec_size) instead of sm100_utils (which doesn't support block-scaled SF layouts) - Proper TMEM column accounting for SFA + SFB + accumulator - Fixed make_blockscaled_trivial_tiled_mma argument order (a_dtype, b_dtype, a_major, b_major, sf_dtype, sf_vec_size, cta_group, mma_inst_shape) - Fixed SFB TMA atom to use tiled_mma_sfb and cluster_layout_sfb_vmnk - Fixed SFB partition_SFB to use tiled_mma_sfb.get_slice - Fixed SFB global tile partitioning to use mma_tiler_sfb - Fixed mainloop_s2t_copy_and_partition to use TMEM fragments (make_fragment_SFA/SFB) as the tSF parameter - Updated run_nvfp4_fused_router wrapper to accept processed weight tensors from Nvfp4Linear._mat_b and _scale_b - Updated test to properly build Nvfp4Linear and use processed weights The old code was a rough sketch that never worked — it was missing the entire tiled_mma_sfb infrastructure, used wrong SMEM layout functions, and had broken TMA atom setup for scale factors. v-nvfp4-fused-router-rewrite-20260601-0715	2026-06-01 07:08:12 +00:00
biondizzle	8658c8eca5	fix: add sf_vec_size parameter back to Nvfp4FusedRouterKernel __init__	2026-06-01 07:01:02 +00:00
biondizzle	b97f30e289	fix: store sf_vec_size as instance variable	2026-06-01 06:56:33 +00:00
biondizzle	c225d195ea	fix: remove tcgen05.mma.Kind (doesn't exist), use make_blockscaled_trivial_tiled_mma	2026-06-01 06:54:49 +00:00
biondizzle	e6803b450d	rewrite: simplified fused router test (reference + import check)	2026-06-01 06:53:17 +00:00
biondizzle	262cec262d	fix: add shape assertions to fused router test	2026-06-01 06:51:47 +00:00
biondizzle	db07d17a62	fix: set activation global scale in fused router test	2026-06-01 06:50:41 +00:00
biondizzle	2abb4a19d9	fix: set gs and ws2 fields for Nvfp4Linear in fused router test	2026-06-01 06:49:43 +00:00
biondizzle	61c04f7152	fix: Nvfp4Linear field is sf not scale_b	2026-06-01 06:48:39 +00:00
biondizzle	982f245c67	fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb)	2026-06-01 06:47:15 +00:00
biondizzle	16af96380f	fix: use internal fields for Nvfp4Linear weight setup in test	2026-06-01 06:46:05 +00:00
biondizzle	7f1f224c78	fix: quantize_weight_to_nvfp4 returns 3 values, not 4	2026-06-01 06:43:53 +00:00
biondizzle	27fd847dd0	fix: correct quantize function name in fused router test	2026-06-01 06:41:54 +00:00
biondizzle	0873d65253	test: add fused router kernel test Compares NVFP4 fused CuTeDSL kernel against reference (Nvfp4Linear + activation_topk) for correctness.	2026-06-01 06:40:46 +00:00
biondizzle	90b2581dfe	feat: NVFP4 fused router CuTeDSL kernel (WIP) Single-kernel NVFP4 block-scaled GEMM + fused sqrt(softplus) + top-k epilogue. Avoids materializing intermediate FP32 logits to GMEM. Architecture: 6-warp specialization - Warp 5 (TMA): Load A, B, SFA, SFB from GMEM → SMEM - Warp 4 (MMA): NVFP4 block-scaled GEMM → FP32 accumulator in TMEM - Warps 0-3 (EPI): TMEM → registers → sqrt(softplus) + bias + top-k → GMEM Epilogue maintains per-thread min-heap across N subtiles, then merges all 128 threads' heaps in SMEM for final top-k selection. Mirrors Sm100BlockScaledPersistentDenseGemmKernel structure for TMA/MMA/SFA/SFB handling, with custom top-k epilogue replacing the standard SwiGLU + TMA store path. NOTE: This is WIP — needs compilation testing on B200. Several API details (tiled_mma_sfb, cluster_layout_sfb_vmnk) need to be passed through the kernel parameters properly.	2026-06-01 06:40:21 +00:00
biondizzle	6c28c57b6a	feat: Nvfp4GroupedLinear for o_a_proj (replaces BF16 grouped BMM) The attention output projection first half (wo_a) was using BF16 grouped BMM (torch.bmm). Now uses production Nvfp4GroupedLinear which performs the same grouped GEMM with NVFP4 tensor-core acceleration on Blackwell. The weight is loaded from NVFP4 checkpoint if available, otherwise quantized from BF16 via set_bf16_weight(). Also includes: - NVFP4 gate projection for router (from previous commit) - Compressor position_bias in CUDA kernel (from earlier fix) v-nvfp4-router-oa-20260601-0610	2026-06-01 06:00:36 +00:00
biondizzle	cf2b7ab7ec	feat: NVFP4 gate projection for router (replaces BF16 cuBLAS) The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate projection when NVFP4 scales are available in the checkpoint. This replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core NVFP4 acceleration. Changes: - dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate fallback. New load_nvfp4_gate() method. - dsv4/kernels/router/dense_router_decode.py: add dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk - dsv4/kernels/router/__init__.py: export new function - single_shot_inference.py: load NVFP4 gate weights when available, fall back to BF16 when not	2026-06-01 05:58:56 +00:00
biondizzle	9f14cb17d1	test: add compressor position_bias unit test Verifies CUDA kernel matches PyTorch reference with and without position_bias for both CSA (m=4) and HCA (m=128) paths.	2026-06-01 05:55:05 +00:00
biondizzle	84ca520bfb	fix: move compressor position_bias into CUDA kernel (was Python loop) The compressor_reduce.cu kernel now adds position_bias to BOTH kv and gate values, matching the PyTorch reference. Previously the kernel only added it to gate, and a Python workaround loop was adding it to both before the kernel call (then passing None to the kernel). Changes: - compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA) - single_shot_inference.py: remove Python position_bias loop, pass self.ape directly to csa/hca_compress_production - production_compress.py: already supports position_bias passthrough	2026-06-01 05:54:44 +00:00
biondizzle	311fae490f	tune: reduce verbose diagnostics, print every decode step v-e2e-paris-32tok-20260601-0549	2026-06-01 05:40:48 +00:00
biondizzle	df8acae66b	fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks v-single-shot-paris-20260601-0539	2026-06-01 05:24:18 +00:00
biondizzle	62041b78bf	fix: import torch.utils.cpp_extension explicitly in production_compress	2026-06-01 05:20:44 +00:00
biondizzle	2155fd6c90	test: production compressor kernel unit test	2026-06-01 05:19:13 +00:00
biondizzle	b380028c49	feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel - New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm One block per compressed entry, 128 threads, FP32 accumulation CSA: overlapping Ca/Cb streams (2m tokens per block) HCA: single stream (m tokens per block) Includes apply_kv_norm kernel (unweighted RMSNorm + weight) - New production_compress.py: Python wrapper for CUDA kernels - single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear for kv_proj, gate_proj, q_b_proj, weights_proj projections Then CUDA reduce kernel for softmax + weighted sum No more PyTorch reference nvfp4_linear_ref in compressor/indexer path	2026-06-01 05:18:59 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow v-working-e2e-20260601-0515	2026-06-01 04:59:06 +00:00
biondizzle	eb9c46f8cb	test: quantize on different GPUs	2026-06-01 04:48:30 +00:00
biondizzle	9ce7304783	test: direct SE L1 test on different GPUs	2026-06-01 04:43:48 +00:00
biondizzle	ce608d0e50	test: fix gemm 1-group test params	2026-06-01 04:40:07 +00:00
biondizzle	c652177970	test: fix gemm 1-group test	2026-06-01 04:35:55 +00:00
biondizzle	793f062bbc	auto: pre-test push for test_gemm_1group.py	2026-06-01 04:32:29 +00:00
biondizzle	86cb0e64a6	auto: pre-test push for test_se_dequant.py	2026-06-01 04:30:37 +00:00
biondizzle	9ba051cf49	test: fix gsa in SE multi-GPU test	2026-06-01 04:26:03 +00:00
biondizzle	419112dd3e	auto: pre-test push for test_se_multi_gpu.py	2026-06-01 04:22:38 +00:00
biondizzle	2cbc7459b0	diag: fix SE scale print (cast to float first)	2026-06-01 04:14:47 +00:00
biondizzle	bcd7a0cf0d	diag: check SE weight and scale integrity for first 3 layers	2026-06-01 04:08:21 +00:00
biondizzle	8ad617e2ff	diag: NaN detection in shared expert gate/up split	2026-06-01 04:01:46 +00:00
biondizzle	a53936a17c	diag: print l1_out shape warning in shared expert	2026-06-01 03:54:29 +00:00
biondizzle	db30c4acd6	auto: pre-test push for test_se_gpu.py	2026-06-01 03:50:53 +00:00

1 2 3 4 5 ...

2067 Commits