nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	b85fcf4d6f	diag: print SE global scales for first 3 layers	2026-06-01 02:49:55 +00:00
biondizzle	48d93a6d2e	diag: MoE input/output diagnostics for first 3 layers	2026-06-01 02:41:12 +00:00
biondizzle	856a459a98	fix: init l1_gsa_list and l2_gsa_list	2026-06-01 02:34:21 +00:00
biondizzle	66b98e5794	fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear)	2026-06-01 02:31:12 +00:00
biondizzle	f4b444b456	fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale	2026-06-01 02:19:35 +00:00
biondizzle	1eed28dd09	diag: compare production FMHA and NVFP4 linear output with PyTorch reference	2026-06-01 02:12:39 +00:00
biondizzle	df394f8b40	fix: missing closing quote on string literal	2026-06-01 02:02:14 +00:00
biondizzle	cfd2468c61	fix: decode loop also needs int32 token_ids for hash router	2026-06-01 01:58:45 +00:00
biondizzle	905623793b	fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N)	2026-06-01 01:49:40 +00:00
biondizzle	7804b779ce	diag: print wo_a g_flat magnitude to find where zeros come from	2026-06-01 01:40:53 +00:00
biondizzle	efe63caea9	diag: print FMHA output magnitude for first 3 layers	2026-06-01 01:34:02 +00:00
biondizzle	7fbbdc5204	diag: validate router output before MoE	2026-06-01 01:27:16 +00:00
biondizzle	f5fa84016e	diag: sync+error check after each layer on first token	2026-06-01 01:26:50 +00:00
biondizzle	91b3929605	fix: call moe_runner.run() and se_runner.run() (not __call__)	2026-06-01 01:14:38 +00:00
biondizzle	03c45d4bfb	fix: pass int32 token_ids to hash router (was int64)	2026-06-01 01:08:03 +00:00
biondizzle	62efde5c9f	fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused)	2026-06-01 01:01:15 +00:00
biondizzle	5591a725e1	fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM)	2026-06-01 00:59:18 +00:00
biondizzle	0ab5d8c317	fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths)	2026-06-01 00:56:00 +00:00
biondizzle	c339fe7ad9	fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error	2026-06-01 00:54:19 +00:00
biondizzle	b7a8c44d26	single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag	2026-06-01 00:42:08 +00:00
biondizzle	15f45b57c3	fix: correct Nvfp4Linear dimension inference from checkpoint weights Weight shape (N_packed, K_packed) means: - out_features = N_packed (GEMM output dim in BF16) - in_features = K_packed * 2 (BF16 input dim, for activation buffer)	2026-06-01 00:32:36 +00:00
biondizzle	e671780008	fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert Critical bug: checkpoint weights are (N_packed, K_packed) N-major format, but make_b_k_major expects (E, K_packed, N_packed) input. Without the permute, the K and N dimensions are swapped, producing garbage output with wrong dimensions (e.g., q_a output was 3584 instead of 1536). Also fix scale assembly: checkpoint scales are (N, K_sf) which should use assemble_raw_scales_2d3d_3d_side (no transpose), not assemble_scales_3d_side (which incorrectly transposes K_sf↔N).	2026-06-01 00:30:37 +00:00
biondizzle	e8a7a9256f	fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors, but checkpoint weights from safetensors are loaded as uint8. The uint8 and float4_e2m1fn_x2 have the same byte representation, so .view() is safe. Fixed in: - Nvfp4Linear.finalize_weights() - Nvfp4SharedExpert.finalize_weights() - Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)	2026-06-01 00:18:34 +00:00
biondizzle	172448514c	fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was being dropped entirely in the production pipeline. The dequant formula is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be folded into the GEMM's global_scale_b parameter. Fixes in: - Nvfp4Linear: ws2 field, folded in finalize_weights() - Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked() - Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights() - single_shot_inference.py: pass weight_scale_2 through all loading paths - Also fix missing o_a_prod key fallback in attention output	2026-06-01 00:10:50 +00:00
biondizzle	563df02aef	fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)	2026-06-01 00:04:48 +00:00
biondizzle	be476b2ce2	router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init	2026-06-01 00:00:07 +00:00
biondizzle	56dff8d185	fix: W_gate is (H, E) but F.linear expects (E, H), transpose before linear	2026-05-31 23:55:16 +00:00
biondizzle	5396a04c28	router: broaden except to catch all CuTeDSL errors, fall through to cuBLAS+activation_topk path	2026-05-31 23:54:16 +00:00
biondizzle	3b5b9f487c	fix: compute num_tma_load_bytes inside cute.compile context	2026-05-31 23:53:13 +00:00
biondizzle	1bc0da0f35	fix: properly scope swap code inside else/guard blocks, replace continue with if guard	2026-05-31 23:51:43 +00:00
biondizzle	d0d765e1f2	fix: replace break statements with flag-based loops in router kernel (CuTeDSL restriction)	2026-05-31 23:50:39 +00:00
biondizzle	210391e571	fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape	2026-05-31 23:49:12 +00:00
biondizzle	824d054ad7	fix: inside cute.compile args are already CuTe tensors, no conversion needed	2026-05-31 23:47:33 +00:00
biondizzle	6375e54396	fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router	2026-05-31 23:46:35 +00:00
biondizzle	cb2ca8591f	fix: add @cute.jit to router compiled function	2026-05-31 23:44:53 +00:00
biondizzle	d5d2b7b4b8	fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern)	2026-05-31 23:44:00 +00:00
biondizzle	157f1c5258	fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel	2026-05-31 23:39:50 +00:00
biondizzle	1dbc57e2cd	fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time)	2026-05-31 23:36:01 +00:00
biondizzle	d05dd50bf5	fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API)	2026-05-31 23:34:54 +00:00
biondizzle	a6a8755439	single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128)	2026-05-31 23:33:32 +00:00
biondizzle	80002f2efc	single_shot: production NVFP4 GEMM for ALL attention projections - Nvfp4Linear (CuTeDSL) for q_a, q_b, kv, o_b — NO more dequant+matmul - Production FMHA (6-warp TMA multi-tile) with per-head sink bias - Production MoE + Router + SharedExpert + mHC (unchanged) - wo_a still uses BF16 grouped BMM (checkpoint is BF16) - Compressor/Indexer still PyTorch ref (not yet on tensor cores) - Proper weight dimensions: q_a(7168->1536), q_b(1536->65536), kv(7168->512), o_b(16384->7168)	2026-05-31 23:28:16 +00:00
biondizzle	32efd5139d	Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E)	2026-05-31 23:21:09 +00:00
biondizzle	e45c0ff51b	single_shot: use reference dequant for attn projections, focus on MoE+FMHA Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT triggered by _ensure_initialized). Disable for now to validate the critical paths first: - Production FMHA with sink bias - Production MoE (Nvfp4MoE + Nvfp4SharedExpert) - Production Router (dense/hash) - Production mHC Attention projections use reference dequant+matmul for now. Will re-enable Nvfp4Linear after validating MoE path.	2026-05-31 23:20:04 +00:00
biondizzle	dfbffa1df1	single_shot: CUDA_LAUNCH_BLOCKING for debugging	2026-05-31 23:18:35 +00:00
biondizzle	a66fdf6049	single_shot: add sync to catch CUDA errors early	2026-05-31 23:17:46 +00:00
biondizzle	0b35c36d23	single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init - MoE expert weights loaded per-expert to GPU (no huge CPU tensors) - Nvfp4Linear finalize_weights deferred (lazy on first forward) - Shared expert weights loaded directly to GPU - Added GPU cache cleanup at start - Fixed shared expert finalize_weights (now lazy)	2026-05-31 23:16:45 +00:00
biondizzle	050b5ee449	Fix n_h reference before assignment in single_shot	2026-05-31 23:14:24 +00:00
biondizzle	c5adbbfde6	FMHA sink: don't double-scale sink bias The sink bias from the checkpoint is already in the scaled domain (added to QKscale in the reference softmax). The kernel's running_max is max(QKscale), so the sink should be compared directly without multiplying by scale again.	2026-05-31 23:12:20 +00:00
biondizzle	4adee1207f	FMHA: zero-init my_p_vals to fix N<128 padding NaN When N<128, padded KV positions have my_p_vals[col] uninitialized for col >= kv_len. The PV GEMM then computes garbage_P × zero_V, which can produce NaN on tensor cores (0 × NaN = NaN). Fix: zero-initialize my_p_vals so padded positions contribute 0.	2026-05-31 23:11:12 +00:00
biondizzle	13be3ad443	FMHA sink bias in kernel + single_shot production rewrite FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh): - Added sink_bias field to FmhaTmaMultiRowMultiTileParams - After KV tile loop, sink logit is included in online softmax rescale: new_max = max(running_max, sink_bias * scale) rescale existing O_unnorm and running_sum running_sum += exp(sink_bias * scale - new_max) No PV contribution from sink (D5c: single softmax) - C API: fmha_multitile_decode_launch now takes sink_bias_ptr - Python: fmha_multitile_decode_raw accepts attn_sink tensor single_shot_inference.py: - Full rewrite to use production kernel stack - mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp) - Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b - FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback) - MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback) - Router: production dense/hash dispatch - Compressor/Indexer: reference dequant (not yet on tensor cores) - NO try/except fallbacks on production paths	2026-05-31 23:10:13 +00:00

1 2 3 4 5 ...

2010 Commits