nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	8658c8eca5	fix: add sf_vec_size parameter back to Nvfp4FusedRouterKernel __init__	2026-06-01 07:01:02 +00:00
biondizzle	b97f30e289	fix: store sf_vec_size as instance variable	2026-06-01 06:56:33 +00:00
biondizzle	c225d195ea	fix: remove tcgen05.mma.Kind (doesn't exist), use make_blockscaled_trivial_tiled_mma	2026-06-01 06:54:49 +00:00
biondizzle	e6803b450d	rewrite: simplified fused router test (reference + import check)	2026-06-01 06:53:17 +00:00
biondizzle	262cec262d	fix: add shape assertions to fused router test	2026-06-01 06:51:47 +00:00
biondizzle	db07d17a62	fix: set activation global scale in fused router test	2026-06-01 06:50:41 +00:00
biondizzle	2abb4a19d9	fix: set gs and ws2 fields for Nvfp4Linear in fused router test	2026-06-01 06:49:43 +00:00
biondizzle	61c04f7152	fix: Nvfp4Linear field is sf not scale_b	2026-06-01 06:48:39 +00:00
biondizzle	982f245c67	fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb)	2026-06-01 06:47:15 +00:00
biondizzle	16af96380f	fix: use internal fields for Nvfp4Linear weight setup in test	2026-06-01 06:46:05 +00:00
biondizzle	7f1f224c78	fix: quantize_weight_to_nvfp4 returns 3 values, not 4	2026-06-01 06:43:53 +00:00
biondizzle	27fd847dd0	fix: correct quantize function name in fused router test	2026-06-01 06:41:54 +00:00
biondizzle	0873d65253	test: add fused router kernel test Compares NVFP4 fused CuTeDSL kernel against reference (Nvfp4Linear + activation_topk) for correctness.	2026-06-01 06:40:46 +00:00
biondizzle	90b2581dfe	feat: NVFP4 fused router CuTeDSL kernel (WIP) Single-kernel NVFP4 block-scaled GEMM + fused sqrt(softplus) + top-k epilogue. Avoids materializing intermediate FP32 logits to GMEM. Architecture: 6-warp specialization - Warp 5 (TMA): Load A, B, SFA, SFB from GMEM → SMEM - Warp 4 (MMA): NVFP4 block-scaled GEMM → FP32 accumulator in TMEM - Warps 0-3 (EPI): TMEM → registers → sqrt(softplus) + bias + top-k → GMEM Epilogue maintains per-thread min-heap across N subtiles, then merges all 128 threads' heaps in SMEM for final top-k selection. Mirrors Sm100BlockScaledPersistentDenseGemmKernel structure for TMA/MMA/SFA/SFB handling, with custom top-k epilogue replacing the standard SwiGLU + TMA store path. NOTE: This is WIP — needs compilation testing on B200. Several API details (tiled_mma_sfb, cluster_layout_sfb_vmnk) need to be passed through the kernel parameters properly.	2026-06-01 06:40:21 +00:00
biondizzle	6c28c57b6a	feat: Nvfp4GroupedLinear for o_a_proj (replaces BF16 grouped BMM) The attention output projection first half (wo_a) was using BF16 grouped BMM (torch.bmm). Now uses production Nvfp4GroupedLinear which performs the same grouped GEMM with NVFP4 tensor-core acceleration on Blackwell. The weight is loaded from NVFP4 checkpoint if available, otherwise quantized from BF16 via set_bf16_weight(). Also includes: - NVFP4 gate projection for router (from previous commit) - Compressor position_bias in CUDA kernel (from earlier fix) v-nvfp4-router-oa-20260601-0610	2026-06-01 06:00:36 +00:00
biondizzle	cf2b7ab7ec	feat: NVFP4 gate projection for router (replaces BF16 cuBLAS) The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate projection when NVFP4 scales are available in the checkpoint. This replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core NVFP4 acceleration. Changes: - dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate fallback. New load_nvfp4_gate() method. - dsv4/kernels/router/dense_router_decode.py: add dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk - dsv4/kernels/router/__init__.py: export new function - single_shot_inference.py: load NVFP4 gate weights when available, fall back to BF16 when not	2026-06-01 05:58:56 +00:00
biondizzle	9f14cb17d1	test: add compressor position_bias unit test Verifies CUDA kernel matches PyTorch reference with and without position_bias for both CSA (m=4) and HCA (m=128) paths.	2026-06-01 05:55:05 +00:00
biondizzle	84ca520bfb	fix: move compressor position_bias into CUDA kernel (was Python loop) The compressor_reduce.cu kernel now adds position_bias to BOTH kv and gate values, matching the PyTorch reference. Previously the kernel only added it to gate, and a Python workaround loop was adding it to both before the kernel call (then passing None to the kernel). Changes: - compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA) - single_shot_inference.py: remove Python position_bias loop, pass self.ape directly to csa/hca_compress_production - production_compress.py: already supports position_bias passthrough	2026-06-01 05:54:44 +00:00
biondizzle	311fae490f	tune: reduce verbose diagnostics, print every decode step v-e2e-paris-32tok-20260601-0549	2026-06-01 05:40:48 +00:00
biondizzle	df8acae66b	fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks v-single-shot-paris-20260601-0539	2026-06-01 05:24:18 +00:00
biondizzle	62041b78bf	fix: import torch.utils.cpp_extension explicitly in production_compress	2026-06-01 05:20:44 +00:00
biondizzle	2155fd6c90	test: production compressor kernel unit test	2026-06-01 05:19:13 +00:00
biondizzle	b380028c49	feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel - New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm One block per compressed entry, 128 threads, FP32 accumulation CSA: overlapping Ca/Cb streams (2m tokens per block) HCA: single stream (m tokens per block) Includes apply_kv_norm kernel (unweighted RMSNorm + weight) - New production_compress.py: Python wrapper for CUDA kernels - single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear for kv_proj, gate_proj, q_b_proj, weights_proj projections Then CUDA reduce kernel for softmax + weighted sum No more PyTorch reference nvfp4_linear_ref in compressor/indexer path	2026-06-01 05:18:59 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow v-working-e2e-20260601-0515	2026-06-01 04:59:06 +00:00
biondizzle	eb9c46f8cb	test: quantize on different GPUs	2026-06-01 04:48:30 +00:00
biondizzle	9ce7304783	test: direct SE L1 test on different GPUs	2026-06-01 04:43:48 +00:00
biondizzle	ce608d0e50	test: fix gemm 1-group test params	2026-06-01 04:40:07 +00:00
biondizzle	c652177970	test: fix gemm 1-group test	2026-06-01 04:35:55 +00:00
biondizzle	793f062bbc	auto: pre-test push for test_gemm_1group.py	2026-06-01 04:32:29 +00:00
biondizzle	86cb0e64a6	auto: pre-test push for test_se_dequant.py	2026-06-01 04:30:37 +00:00
biondizzle	9ba051cf49	test: fix gsa in SE multi-GPU test	2026-06-01 04:26:03 +00:00
biondizzle	419112dd3e	auto: pre-test push for test_se_multi_gpu.py	2026-06-01 04:22:38 +00:00
biondizzle	2cbc7459b0	diag: fix SE scale print (cast to float first)	2026-06-01 04:14:47 +00:00
biondizzle	bcd7a0cf0d	diag: check SE weight and scale integrity for first 3 layers	2026-06-01 04:08:21 +00:00
biondizzle	8ad617e2ff	diag: NaN detection in shared expert gate/up split	2026-06-01 04:01:46 +00:00
biondizzle	a53936a17c	diag: print l1_out shape warning in shared expert	2026-06-01 03:54:29 +00:00
biondizzle	db30c4acd6	auto: pre-test push for test_se_gpu.py	2026-06-01 03:50:53 +00:00
biondizzle	3dd95ce77b	fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them)	2026-06-01 03:43:09 +00:00
biondizzle	27c63b01d6	diag: remove broken SE reference comparison, add gsa/gsb print	2026-06-01 03:31:36 +00:00
biondizzle	9a27ed21fd	diag: compare shared expert output with PyTorch reference	2026-06-01 03:25:21 +00:00
biondizzle	ee8318ad58	diag: handle NaN in shared expert output print	2026-06-01 03:16:25 +00:00
biondizzle	7000762309	diag: fix SE weight attribute name	2026-06-01 03:09:11 +00:00
biondizzle	fba1c06cad	diag: check SE weight integrity	2026-06-01 03:02:44 +00:00
biondizzle	22d7cc9b7a	diag: cuda sync check after shared expert for first 3 layers	2026-06-01 02:56:28 +00:00
biondizzle	b85fcf4d6f	diag: print SE global scales for first 3 layers	2026-06-01 02:49:55 +00:00
biondizzle	48d93a6d2e	diag: MoE input/output diagnostics for first 3 layers	2026-06-01 02:41:12 +00:00
biondizzle	856a459a98	fix: init l1_gsa_list and l2_gsa_list	2026-06-01 02:34:21 +00:00
biondizzle	66b98e5794	fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear)	2026-06-01 02:31:12 +00:00
biondizzle	f4b444b456	fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale	2026-06-01 02:19:35 +00:00
biondizzle	1eed28dd09	diag: compare production FMHA and NVFP4 linear output with PyTorch reference	2026-06-01 02:12:39 +00:00

1 2 3 4 5 ...

2054 Commits