nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	df8acae66b	fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks v-single-shot-paris-20260601-0539	2026-06-01 05:24:18 +00:00
biondizzle	62041b78bf	fix: import torch.utils.cpp_extension explicitly in production_compress	2026-06-01 05:20:44 +00:00
biondizzle	2155fd6c90	test: production compressor kernel unit test	2026-06-01 05:19:13 +00:00
biondizzle	b380028c49	feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel - New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm One block per compressed entry, 128 threads, FP32 accumulation CSA: overlapping Ca/Cb streams (2m tokens per block) HCA: single stream (m tokens per block) Includes apply_kv_norm kernel (unweighted RMSNorm + weight) - New production_compress.py: Python wrapper for CUDA kernels - single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear for kv_proj, gate_proj, q_b_proj, weights_proj projections Then CUDA reduce kernel for softmax + weighted sum No more PyTorch reference nvfp4_linear_ref in compressor/indexer path	2026-06-01 05:18:59 +00:00
biondizzle	6e53e3007c	fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow v-working-e2e-20260601-0515	2026-06-01 04:59:06 +00:00
biondizzle	eb9c46f8cb	test: quantize on different GPUs	2026-06-01 04:48:30 +00:00
biondizzle	9ce7304783	test: direct SE L1 test on different GPUs	2026-06-01 04:43:48 +00:00
biondizzle	ce608d0e50	test: fix gemm 1-group test params	2026-06-01 04:40:07 +00:00
biondizzle	c652177970	test: fix gemm 1-group test	2026-06-01 04:35:55 +00:00
biondizzle	793f062bbc	auto: pre-test push for test_gemm_1group.py	2026-06-01 04:32:29 +00:00
biondizzle	86cb0e64a6	auto: pre-test push for test_se_dequant.py	2026-06-01 04:30:37 +00:00
biondizzle	9ba051cf49	test: fix gsa in SE multi-GPU test	2026-06-01 04:26:03 +00:00
biondizzle	419112dd3e	auto: pre-test push for test_se_multi_gpu.py	2026-06-01 04:22:38 +00:00
biondizzle	2cbc7459b0	diag: fix SE scale print (cast to float first)	2026-06-01 04:14:47 +00:00
biondizzle	bcd7a0cf0d	diag: check SE weight and scale integrity for first 3 layers	2026-06-01 04:08:21 +00:00
biondizzle	8ad617e2ff	diag: NaN detection in shared expert gate/up split	2026-06-01 04:01:46 +00:00
biondizzle	a53936a17c	diag: print l1_out shape warning in shared expert	2026-06-01 03:54:29 +00:00
biondizzle	db30c4acd6	auto: pre-test push for test_se_gpu.py	2026-06-01 03:50:53 +00:00
biondizzle	3dd95ce77b	fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them)	2026-06-01 03:43:09 +00:00
biondizzle	27c63b01d6	diag: remove broken SE reference comparison, add gsa/gsb print	2026-06-01 03:31:36 +00:00
biondizzle	9a27ed21fd	diag: compare shared expert output with PyTorch reference	2026-06-01 03:25:21 +00:00
biondizzle	ee8318ad58	diag: handle NaN in shared expert output print	2026-06-01 03:16:25 +00:00
biondizzle	7000762309	diag: fix SE weight attribute name	2026-06-01 03:09:11 +00:00
biondizzle	fba1c06cad	diag: check SE weight integrity	2026-06-01 03:02:44 +00:00
biondizzle	22d7cc9b7a	diag: cuda sync check after shared expert for first 3 layers	2026-06-01 02:56:28 +00:00
biondizzle	b85fcf4d6f	diag: print SE global scales for first 3 layers	2026-06-01 02:49:55 +00:00
biondizzle	48d93a6d2e	diag: MoE input/output diagnostics for first 3 layers	2026-06-01 02:41:12 +00:00
biondizzle	856a459a98	fix: init l1_gsa_list and l2_gsa_list	2026-06-01 02:34:21 +00:00
biondizzle	66b98e5794	fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear)	2026-06-01 02:31:12 +00:00
biondizzle	f4b444b456	fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale	2026-06-01 02:19:35 +00:00
biondizzle	1eed28dd09	diag: compare production FMHA and NVFP4 linear output with PyTorch reference	2026-06-01 02:12:39 +00:00
biondizzle	df394f8b40	fix: missing closing quote on string literal	2026-06-01 02:02:14 +00:00
biondizzle	cfd2468c61	fix: decode loop also needs int32 token_ids for hash router	2026-06-01 01:58:45 +00:00
biondizzle	905623793b	fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N)	2026-06-01 01:49:40 +00:00
biondizzle	7804b779ce	diag: print wo_a g_flat magnitude to find where zeros come from	2026-06-01 01:40:53 +00:00
biondizzle	efe63caea9	diag: print FMHA output magnitude for first 3 layers	2026-06-01 01:34:02 +00:00
biondizzle	7fbbdc5204	diag: validate router output before MoE	2026-06-01 01:27:16 +00:00
biondizzle	f5fa84016e	diag: sync+error check after each layer on first token	2026-06-01 01:26:50 +00:00
biondizzle	91b3929605	fix: call moe_runner.run() and se_runner.run() (not __call__)	2026-06-01 01:14:38 +00:00
biondizzle	03c45d4bfb	fix: pass int32 token_ids to hash router (was int64)	2026-06-01 01:08:03 +00:00
biondizzle	62efde5c9f	fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused)	2026-06-01 01:01:15 +00:00
biondizzle	5591a725e1	fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM)	2026-06-01 00:59:18 +00:00
biondizzle	0ab5d8c317	fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths)	2026-06-01 00:56:00 +00:00
biondizzle	c339fe7ad9	fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error	2026-06-01 00:54:19 +00:00
biondizzle	b7a8c44d26	single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag	2026-06-01 00:42:08 +00:00
biondizzle	15f45b57c3	fix: correct Nvfp4Linear dimension inference from checkpoint weights Weight shape (N_packed, K_packed) means: - out_features = N_packed (GEMM output dim in BF16) - in_features = K_packed * 2 (BF16 input dim, for activation buffer)	2026-06-01 00:32:36 +00:00
biondizzle	e671780008	fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert Critical bug: checkpoint weights are (N_packed, K_packed) N-major format, but make_b_k_major expects (E, K_packed, N_packed) input. Without the permute, the K and N dimensions are swapped, producing garbage output with wrong dimensions (e.g., q_a output was 3584 instead of 1536). Also fix scale assembly: checkpoint scales are (N, K_sf) which should use assemble_raw_scales_2d3d_3d_side (no transpose), not assemble_scales_3d_side (which incorrectly transposes K_sf↔N).	2026-06-01 00:30:37 +00:00
biondizzle	e8a7a9256f	fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors, but checkpoint weights from safetensors are loaded as uint8. The uint8 and float4_e2m1fn_x2 have the same byte representation, so .view() is safe. Fixed in: - Nvfp4Linear.finalize_weights() - Nvfp4SharedExpert.finalize_weights() - Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)	2026-06-01 00:18:34 +00:00
biondizzle	172448514c	fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was being dropped entirely in the production pipeline. The dequant formula is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be folded into the GEMM's global_scale_b parameter. Fixes in: - Nvfp4Linear: ws2 field, folded in finalize_weights() - Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked() - Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights() - single_shot_inference.py: pass weight_scale_2 through all loading paths - Also fix missing o_a_prod key fallback in attention output	2026-06-01 00:10:50 +00:00
biondizzle	563df02aef	fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)	2026-06-01 00:04:48 +00:00

1 2 3 4 5 ...

2035 Commits