nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	62efde5c9f	fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused)	2026-06-01 01:01:15 +00:00
biondizzle	5591a725e1	fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM)	2026-06-01 00:59:18 +00:00
biondizzle	0ab5d8c317	fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths)	2026-06-01 00:56:00 +00:00
biondizzle	c339fe7ad9	fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error	2026-06-01 00:54:19 +00:00
biondizzle	e671780008	fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert Critical bug: checkpoint weights are (N_packed, K_packed) N-major format, but make_b_k_major expects (E, K_packed, N_packed) input. Without the permute, the K and N dimensions are swapped, producing garbage output with wrong dimensions (e.g., q_a output was 3584 instead of 1536). Also fix scale assembly: checkpoint scales are (N, K_sf) which should use assemble_raw_scales_2d3d_3d_side (no transpose), not assemble_scales_3d_side (which incorrectly transposes K_sf↔N).	2026-06-01 00:30:37 +00:00
biondizzle	e8a7a9256f	fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors, but checkpoint weights from safetensors are loaded as uint8. The uint8 and float4_e2m1fn_x2 have the same byte representation, so .view() is safe. Fixed in: - Nvfp4Linear.finalize_weights() - Nvfp4SharedExpert.finalize_weights() - Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)	2026-06-01 00:18:34 +00:00
biondizzle	172448514c	fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was being dropped entirely in the production pipeline. The dequant formula is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be folded into the GEMM's global_scale_b parameter. Fixes in: - Nvfp4Linear: ws2 field, folded in finalize_weights() - Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked() - Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights() - single_shot_inference.py: pass weight_scale_2 through all loading paths - Also fix missing o_a_prod key fallback in attention output	2026-06-01 00:10:50 +00:00
biondizzle	563df02aef	fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)	2026-06-01 00:04:48 +00:00
biondizzle	be476b2ce2	router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init	2026-06-01 00:00:07 +00:00
biondizzle	56dff8d185	fix: W_gate is (H, E) but F.linear expects (E, H), transpose before linear	2026-05-31 23:55:16 +00:00
biondizzle	5396a04c28	router: broaden except to catch all CuTeDSL errors, fall through to cuBLAS+activation_topk path	2026-05-31 23:54:16 +00:00
biondizzle	3b5b9f487c	fix: compute num_tma_load_bytes inside cute.compile context	2026-05-31 23:53:13 +00:00
biondizzle	1bc0da0f35	fix: properly scope swap code inside else/guard blocks, replace continue with if guard	2026-05-31 23:51:43 +00:00
biondizzle	d0d765e1f2	fix: replace break statements with flag-based loops in router kernel (CuTeDSL restriction)	2026-05-31 23:50:39 +00:00
biondizzle	210391e571	fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape	2026-05-31 23:49:12 +00:00
biondizzle	824d054ad7	fix: inside cute.compile args are already CuTe tensors, no conversion needed	2026-05-31 23:47:33 +00:00
biondizzle	6375e54396	fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router	2026-05-31 23:46:35 +00:00
biondizzle	cb2ca8591f	fix: add @cute.jit to router compiled function	2026-05-31 23:44:53 +00:00
biondizzle	d5d2b7b4b8	fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern)	2026-05-31 23:44:00 +00:00
biondizzle	157f1c5258	fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel	2026-05-31 23:39:50 +00:00
biondizzle	1dbc57e2cd	fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time)	2026-05-31 23:36:01 +00:00
biondizzle	d05dd50bf5	fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API)	2026-05-31 23:34:54 +00:00
biondizzle	c5adbbfde6	FMHA sink: don't double-scale sink bias The sink bias from the checkpoint is already in the scaled domain (added to QKscale in the reference softmax). The kernel's running_max is max(QKscale), so the sink should be compared directly without multiplying by scale again.	2026-05-31 23:12:20 +00:00
biondizzle	4adee1207f	FMHA: zero-init my_p_vals to fix N<128 padding NaN When N<128, padded KV positions have my_p_vals[col] uninitialized for col >= kv_len. The PV GEMM then computes garbage_P × zero_V, which can produce NaN on tensor cores (0 × NaN = NaN). Fix: zero-initialize my_p_vals so padded positions contribute 0.	2026-05-31 23:11:12 +00:00
biondizzle	13be3ad443	FMHA sink bias in kernel + single_shot production rewrite FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh): - Added sink_bias field to FmhaTmaMultiRowMultiTileParams - After KV tile loop, sink logit is included in online softmax rescale: new_max = max(running_max, sink_bias * scale) rescale existing O_unnorm and running_sum running_sum += exp(sink_bias * scale - new_max) No PV contribution from sink (D5c: single softmax) - C API: fmha_multitile_decode_launch now takes sink_bias_ptr - Python: fmha_multitile_decode_raw accepts attn_sink tensor single_shot_inference.py: - Full rewrite to use production kernel stack - mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp) - Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b - FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback) - MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback) - Router: production dense/hash dispatch - Compressor/Indexer: reference dequant (not yet on tensor cores) - NO try/except fallbacks on production paths	2026-05-31 23:10:13 +00:00
biondizzle	92200367f3	FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128 ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment but then passed the PADDED N to the kernel as s_k (logical KV length). This told the kernel all 128 entries were valid, so softmax ran over zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128). FIX: Pass N_orig (true sequence length) as s_k for softmax masking, and N_padded (physical size) only for TMA descriptor creation. The kernel's existing col < kv_len guard correctly excludes padded entries from row_max and exp_sum calculations. Files changed: - fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for params.s_k and N_padded for TMA descriptors - fmha_multitile_op.py: pass N_orig and N_padded separately - single_shot_inference.py: removed SDPA fallback (kernel now correct)	2026-05-31 22:52:39 +00:00
biondizzle	2a886fe0f2	Add --no-thinking mode to skip thinking tokens and use second-best	2026-05-31 19:24:21 +00:00
biondizzle	7d9e70c5d5	Fix remaining mHC API references: layer_compare.py, layer.py comment	2026-05-31 18:38:34 +00:00
biondizzle	7b123d159f	CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection): 1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb] - Was applying Sinkhorn to post values and 2*sigmoid to comb values - This caused residual to grow unbounded (no doubly-stochastic constraint) 2. comb (B_l) must be TRANSPOSED in post_block - HF: comb.transpose(-1,-2) @ hidden_streams - Was using B_l @ X_l without transpose 3. Sinkhorn must start from softmax(logits) + eps, not exp(logits) - HF: softmax → col norm → (iters-1) alternating - Was using exp → alternating (different convergence behavior) 4. Missing hc_eps on pre (A_l) - HF: sigmoid(...) + hc_eps - Was missing the eps guard 5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout - Matches checkpoint naming and HF model 6. Fixed fallback mHC initialization to use new API	2026-05-31 18:38:12 +00:00
biondizzle	1c18c16c68	Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16	2026-05-31 09:17:36 +00:00
biondizzle	df6220abaf	E5: Fold batch loop into native kernel grid (blockIdx.z) The 6-warp multi-tile kernel already supports batch natively via dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input. Single kernel launch per layer for batched decode instead of batch_size launches. T>1 prefill still uses per-batch dispatch (E8 future work).	2026-05-30 21:21:02 +00:00
biondizzle	9d88769f5f	Wire indexer compute_index_scores_topk + fix compressor imports - indexer/__init__.py: compute_index_scores_topk now calls run_indexer_score_topk with proper tensor reshaping - compressor/__init__.py: added torch import, fixed csa_compress_tail and hca_compress_tail imports for flush.py - Full flush pipeline now importable end-to-end	2026-05-30 21:19:06 +00:00
biondizzle	daf84524ac	E2/E3: compressor bridge, indexer bridge, flush pipeline wiring - compress_tail.py: PyTorch reference CSA/HCA compression (token-level softmax over m/m' entries, paper eq. 11-12) - compressor/__init__.py: csa_compress_and_store, hca_compress_and_store bridges (compression deferred to flush pipeline) - indexer/__init__.py: compute_index_scores_topk bridge (NotImplemented) - Fixed attention.py: removed extra positions arg to write_swa	2026-05-30 21:16:54 +00:00
biondizzle	d3b772196d	E3: Implement DSV4Model — full model class - Token embedding → N×TransformerLayer → RMSNorm → lm_head - decode_step: single token decode with mHC state management - forward: prefill path (T tokens) - Cache handle acquisition per layer - mHC state initialization from embedding - Weight loading TODO (deferred to loader/)	2026-05-30 21:15:57 +00:00
biondizzle	b0cdd5af74	fix: extern declarations for gather_swa functions in gather_kv.cu	2026-05-30 21:14:15 +00:00
biondizzle	016d722abc	fix: single PYBIND11_MODULE for combined gather .so Both gather_kv.cu and gather_swa.cu are compiled into one .so. Only gather_kv.cu defines the PYBIND11_MODULE; gather_swa.cu just provides the function implementations.	2026-05-30 21:13:24 +00:00
biondizzle	8fb9d89658	fix: correct gather.py kernel_dir path	2026-05-30 21:12:09 +00:00
biondizzle	300dddedc0	E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test E1: LayerCacheHandle now exposes gather_compressed_kv, gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim. Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu. Python wrapper in dsv4/kernels/cache/gather.py. E2: tests/e2e/test_one_layer.py — SWA path smoke test. E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs for CSA/HCA compress_and_store, compute_index_scores_topk). E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path. Error checking via C API return code instead. Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).	2026-05-30 21:10:26 +00:00
biondizzle	faf92b30ad	E1: Wire LayerCacheHandle gather methods + CUDA gather kernels - gather_compressed_kv: CSA top-k gather via existing gather_kv.cu - gather_all_compressed_kv: HCA dense gather via new gather_all_compressed_kernel - gather_swa_kv: SWA ring buffer gather via new gather_swa_kernel - Added gather_swa.cu with both SWA + all-compressed gather kernels - Added gather.py Python wrapper (torch.utils.cpp_extension JIT) - Updated handle.py: added schema field, num_query_heads/head_dim properties - Updated manager.py: passes schema + num_query_heads to handle All gather kernels: FP8→BF16 dequant + BF16 RoPE concat in single launch. Output: dense BF16 tensors ready for FMHA consumption.	2026-05-30 21:09:21 +00:00
biondizzle	4b9eed02e1	Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files - Deleted fmha.py (CuTeDSL slow path), FmhaKernel, Python KV merge - Deleted fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh - Moved fmha_qk_verify.cuh to tests/unit/qk_verify_kernel.cuh - Deleted decode_sparse.py, decode_swa.py, kernels/decode/ - Deleted 46 test_d.py probes, test_smem_, test_cotiled_, test_tmem_, test_smem_p_, test_ultra_minimal, test_fmha_pv16, test_working_softmax_maybe - Deleted root scratch: debug_linear.py, test_mapping.py, run_router_tests.py - Moved archive/ to archived_plans/code_archive/ - Rewrote production.py: single fast path via 6-warp multi-tile kernel - Added STATUS.md, audit_attention_live.md - Moved NEXT_PRIORITIES.md to archived_plans/	2026-05-30 21:08:12 +00:00
biondizzle	95725f1df0	P8: Delete 6 redundant .cuh variants + multihead CAPI/op Kept: fmha_6warp_tma_multirow_multitile.cuh (production kernel) Deleted: fmha_6warp.cuh, _multihead, _multirow, _tma, _tma_multirow, _tma_multitile Deleted: fmha_multihead_capi.cu, fmha_multihead_op.py production.py: Removed _dsv4_attention_fast_decode, unified dispatch to _dsv4_attention_multitile for all fast-path cases.	2026-05-30 17:21:15 +00:00
biondizzle	9d483b1c54	P8: Unified dispatch — multi-tile kernel handles all N production.py: Single fast path using multi-tile kernel for all N. Eliminates the separate _dsv4_attention_fast_decode path.	2026-05-30 17:19:09 +00:00
biondizzle	c0379a0f86	P6: Remove broken TMA store — use direct GMEM write from SMEM cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100. The CUTLASS SM100 epilogue uses st.global directly. The one-way epilogue pipeline is now: 1. TMEM → regs (tcgen05.ld, warp-collective) 2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE) 3. regs → SMEM (row-major, sO_epi) 4. SMEM → GMEM (direct write) This is the same pattern as the MoE kernel but with st.global instead of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords. Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch, sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.	2026-05-30 17:11:17 +00:00
biondizzle	f97359fbfc	P6: TMA store uses mbarrier completion (same as load) TMA store: cp.async.bulk.tensor.2d.global.shared::cluster.mbarrier::complete_tx::bytes Uses mbarrier for completion, not bulk_group. Restored sMbarStore to SMEM.	2026-05-30 17:07:24 +00:00
biondizzle	2de300e281	P6: Try shared::cluster instead of shared::cta for TMA store	2026-05-30 17:05:27 +00:00
biondizzle	829a5f93ce	P6: Fix TMA store PTX — remove .tile modifier, fix wait_group syntax	2026-05-30 17:04:38 +00:00
biondizzle	fd7c0cb773	P6: Fix TMA store — use bulk_group (commit+wait) not mbarrier TMA store uses cp.async.bulk.tensor.2d.global.shared::cta.tile.bulk_group NOT mbarrier::complete_tx::bytes. Completion tracked via: - cp.async.bulk.commit_group (after issuing stores) - cp.async.bulk.wait_group.read 0 (wait for all groups) Removed sMbarStore from SMEM allocations (no longer needed).	2026-05-30 16:57:35 +00:00
biondizzle	212fc85627	P6: One-way TMEM→regs→SMEM→TMA store epilogue - fmha_6warp_multihead.cuh: Rewritten epilogue with proper Blackwell pipeline 1. TMEM → regs (tcgen05.ld, warp-collective) 2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE) 3. regs → SMEM row-major (sO_epi, for TMA tile format) 4. TMA store SMEM → GMEM (async, enables multi-CTA) Fallback to direct GMEM write when tma_o is nullptr. Added FmhaParams.tma_o field and ENABLE_FP4_EPILOGUE template param. - fmha_6warp_tma_multirow_multitile.cuh: Same epilogue pattern for multi-tile. Writes normalized output to sO_epi_rowmajor + TMA store (or direct GMEM). Added tma_o to FmhaTmaMultiRowMultiTileParams. - fmha_tma.cuh: Added tma_store_2d and tma_store_wait for async GMEM writes. - fmha_multihead_capi.cu: Added fmha_multihead_decode_tma_launch with per-(head,batch) TMA descriptors. Updated SMEM size calculation for sO_epi + sMbarStore. - fmha_multitile_capi.cu: Added tma_o=nullptr (backward compatible), updated SMEM size.	2026-05-30 16:56:07 +00:00
biondizzle	897a70a491	P5: minimal Python multi-tile test	2026-05-30 10:43:26 +00:00
biondizzle	a2627359fb	P5: fix TMA desc creation — write to HOST then cudaMemcpy to device	2026-05-30 10:40:01 +00:00

1 2 3 4 5 ...

518 Commits