nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	afcc690ddc	Add full MoE routing + KV cache to single_shot MoE: - Hash routing (first 3 layers): tid2eid lookup → 6 experts, uniform weights - Dense routing (remaining): sqrt(softplus(gate)) → top-6 → renormalize - 384 NVFP4 experts, each gate+up+down with SiGLU clamping - Weighted combine × routed_scaling_factor + shared expert KV cache: - SimpleKVCache: BF16 flat (1, max_seq, hd) per layer - Appends new K,V each decode step - FMHA now attends over full cached sequence (not just current token) - RoPE applied per-position on K cache This should produce meaningful output — the model now has all architectural components except proper mHC normalization.	2026-05-31 00:11:15 +00:00
biondizzle	3ecfbcba57	Fix T scope in post_block	2026-05-31 00:02:29 +00:00
biondizzle	a493f72681	Add per-residual RMSNorm in mHC post_block (routed MoE missing) Without routed experts, F_out is always positive, causing unbounded growth. Emergency RMSNorm on the residual keeps values bounded. Remove once MoE is wired.	2026-05-30 23:59:19 +00:00
biondizzle	49282fe206	Fix mHC: match vLLM torch reference exactly Key corrections: - RMSNorm applied to projection output (mixes = rsqrt(sqrsum/K + eps)) not to the input before projection - comb_mix uses softmax + Sinkhorn, NOT exp + Sinkhorn - pre_mix = sigmoid(logits) + eps (not matmul with X_l) - layer_input = sum(pre_mix residual) — weighted sum, not bmm - post_mix = sigmoid * hc_post_mult_value (2.0) - bias split: [pre(4), post(4), comb(16)] not [pre(4), comb(16), post(4)]	2026-05-30 23:55:27 +00:00
biondizzle	66a66f8244	Add per-layer NaN tracking for mHC debug	2026-05-30 23:48:32 +00:00
biondizzle	d003c4b7cc	Add mHC (Manifold-Constrained Hyper-Connections) to single_shot - Full mHC pre_block/post_block with Sinkhorn-Knopp normalization - Dynamic A_l (sigmoid), B_l (Birkhoff polytope), C_l (2*sigmoid) - Checkpoint: attn_hc.fn (24,28672) + base (24,) + scale (3,) - Two mHC blocks per layer: attn_hc + ffn_hc - Removed emergency RMSNorm — mHC handles normalization properly - X_l: (1, n_hc=4, H) residual state, init from embedding broadcast	2026-05-30 23:45:18 +00:00
biondizzle	f567c20539	Fix: set active CUDA device per layer for BMM/FMHA	2026-05-30 23:39:45 +00:00
biondizzle	7a95983e0f	Rewrite single_shot: 8-GPU pipeline parallel - Loads all 95 shards, assigns layers round-robin across 8 B200s - ~8 layers per GPU, ~118GB weights per GPU (fits in 183GB) - 3-phase pipeline: load weights → JIT compile → inference - Activations move between GPUs at layer boundaries (NVLink) - No streaming, no shard caching, no per-layer CPU loads - Includes timing for each phase	2026-05-30 23:36:14 +00:00
biondizzle	aac0fa1f08	Update STATUS.md + MEMORY.md: single-shot inference verified	2026-05-30 22:59:27 +00:00
biondizzle	11c010e567	Update output section: kernel verified, architecture gaps noted	2026-05-30 22:58:49 +00:00
biondizzle	53178d2536	Add emergency RMSNorm after residuals (missing mHC fallback) Without mHC, values explode to 761K after first layer. Added per-residual RMSNorm + BF16 clamp to keep values bounded. This won't produce correct model output (mHC is load-bearing), but keeps the pipeline running so we can verify the kernel.	2026-05-30 22:56:16 +00:00
biondizzle	172ba75e0c	Add per-layer NaN check to track where values diverge	2026-05-30 22:54:57 +00:00
biondizzle	ec7846e28c	Add NaN tracking to single_shot_inference	2026-05-30 22:53:09 +00:00
biondizzle	5fa6c88b17	Fix: replace FP4 Inf with 24 (avoid NaN in dequant)	2026-05-30 22:51:10 +00:00
biondizzle	904753f62a	Fix: BMM batch dim alignment for wo_a	2026-05-30 22:49:21 +00:00
biondizzle	52df3bc26c	Fix: wo_a as batched matmul (grouped linear for output projection)	2026-05-30 22:48:31 +00:00
biondizzle	19240608d7	Fix: handle o_a_proj grouped linear shape mismatch	2026-05-30 22:46:12 +00:00
biondizzle	1d02758416	Fix: kv_proj outputs hd=512 (1 KV head MQA), Z from compressor.gate_proj	2026-05-30 22:45:14 +00:00
biondizzle	5dcfb333ea	Fix: move weight tensors to CUDA before dequant	2026-05-30 22:43:47 +00:00
biondizzle	47c7b3c50b	Fix: ensure FP4 LUT on CUDA before index op	2026-05-30 22:43:01 +00:00
biondizzle	13bae9dd55	Fix single_shot: mHC replaces layernorm, no hidden-level norm in DSV4	2026-05-30 22:42:17 +00:00
biondizzle	e8334fc4af	Rewrite single_shot_inference.py — complete forward pass - NVFP4 dequant with proper E2M1 LUT + E4M3 scale + global scale - RoPE (GPT-J partial, last 64 dims) - Q low-rank projection (q_a → q_b) - KV projection (layer-type-aware: HCA/CSA/SWA) - Production FMHA kernel (tcgen05 MMA) - Output projection: o_a (BF16 grouped) → o_b (NVFP4) - Shared expert FFN (gate/up/down, SiLU) - RMSNorm for both attention and FFN - Streaming weight loading (one layer at a time)	2026-05-30 22:40:56 +00:00
biondizzle	9b0858aa35	Add single_shot_inference.py — baseline kernel verification Streams weights one layer at a time from 95 safetensors shards. NVFP4 dequant → BF16 matmul for baseline (production uses tcgen05 MMA). Runs token-by-token decode loop with production FMHA kernel. Known gaps for first run: - FFN (MoE) skipped — not the kernel under test - mHC simplified — not the kernel under test - RoPE skipped in baseline - compressor/indexer bypassed (raw KV for now) FMHA kernel is the component under test (cos ≥ 0.999993).	2026-05-30 22:39:01 +00:00
biondizzle	4472928506	E3: model construction test	2026-05-30 21:22:34 +00:00
biondizzle	afc07a5d1a	Update STATUS.md: E5 done	2026-05-30 21:21:47 +00:00
biondizzle	df6220abaf	E5: Fold batch loop into native kernel grid (blockIdx.z) The 6-warp multi-tile kernel already supports batch natively via dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input. Single kernel launch per layer for batched decode instead of batch_size launches. T>1 prefill still uses per-batch dispatch (E8 future work).	2026-05-30 21:21:02 +00:00
biondizzle	e162a2d112	Update STATUS.md: E1-E4 done	2026-05-30 21:20:10 +00:00
biondizzle	c4b40dd06c	E2: CSA/HCA integration test — gather + FMHA end-to-end Tests: - CSA: gather_compressed_kv (top-k) + gather_swa_kv + sparse FMHA - HCA: gather_all_compressed_kv + gather_swa_kv + dense FMHA - Verifies shapes, dtypes, and numerical sanity (no NaN/Inf)	2026-05-30 21:19:28 +00:00
biondizzle	9d88769f5f	Wire indexer compute_index_scores_topk + fix compressor imports - indexer/__init__.py: compute_index_scores_topk now calls run_indexer_score_topk with proper tensor reshaping - compressor/__init__.py: added torch import, fixed csa_compress_tail and hca_compress_tail imports for flush.py - Full flush pipeline now importable end-to-end	2026-05-30 21:19:06 +00:00
biondizzle	daf84524ac	E2/E3: compressor bridge, indexer bridge, flush pipeline wiring - compress_tail.py: PyTorch reference CSA/HCA compression (token-level softmax over m/m' entries, paper eq. 11-12) - compressor/__init__.py: csa_compress_and_store, hca_compress_and_store bridges (compression deferred to flush pipeline) - indexer/__init__.py: compute_index_scores_topk bridge (NotImplemented) - Fixed attention.py: removed extra positions arg to write_swa	2026-05-30 21:16:54 +00:00
biondizzle	d3b772196d	E3: Implement DSV4Model — full model class - Token embedding → N×TransformerLayer → RMSNorm → lm_head - decode_step: single token decode with mHC state management - forward: prefill path (T tokens) - Cache handle acquisition per layer - mHC state initialization from embedding - Weight loading TODO (deferred to loader/)	2026-05-30 21:15:57 +00:00
biondizzle	b0cdd5af74	fix: extern declarations for gather_swa functions in gather_kv.cu	2026-05-30 21:14:15 +00:00
biondizzle	016d722abc	fix: single PYBIND11_MODULE for combined gather .so Both gather_kv.cu and gather_swa.cu are compiled into one .so. Only gather_kv.cu defines the PYBIND11_MODULE; gather_swa.cu just provides the function implementations.	2026-05-30 21:13:24 +00:00
biondizzle	8fb9d89658	fix: correct gather.py kernel_dir path	2026-05-30 21:12:09 +00:00
biondizzle	924707a673	fix: add FFNType/RouterMode to LayerSpec in e2e test	2026-05-30 21:11:04 +00:00
biondizzle	e2e21c6350	fix: remove unused pytest import from e2e test	2026-05-30 21:10:43 +00:00
biondizzle	300dddedc0	E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test E1: LayerCacheHandle now exposes gather_compressed_kv, gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim. Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu. Python wrapper in dsv4/kernels/cache/gather.py. E2: tests/e2e/test_one_layer.py — SWA path smoke test. E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs for CSA/HCA compress_and_store, compute_index_scores_topk). E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path. Error checking via C API return code instead. Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).	2026-05-30 21:10:26 +00:00
biondizzle	faf92b30ad	E1: Wire LayerCacheHandle gather methods + CUDA gather kernels - gather_compressed_kv: CSA top-k gather via existing gather_kv.cu - gather_all_compressed_kv: HCA dense gather via new gather_all_compressed_kernel - gather_swa_kv: SWA ring buffer gather via new gather_swa_kernel - Added gather_swa.cu with both SWA + all-compressed gather kernels - Added gather.py Python wrapper (torch.utils.cpp_extension JIT) - Updated handle.py: added schema field, num_query_heads/head_dim properties - Updated manager.py: passes schema + num_query_heads to handle All gather kernels: FP8→BF16 dequant + BF16 RoPE concat in single launch. Output: dense BF16 tensors ready for FMHA consumption.	2026-05-30 21:09:21 +00:00
biondizzle	4b9eed02e1	Cleanup C1-C7: delete dead CuTeDSL FMHA, test probes, scratch files - Deleted fmha.py (CuTeDSL slow path), FmhaKernel, Python KV merge - Deleted fmha_sm100.cuh, fmha_sm100_tc.cuh, fmha_sm100_launch.cu, fmha_epilogue_sm100.cuh - Moved fmha_qk_verify.cuh to tests/unit/qk_verify_kernel.cuh - Deleted decode_sparse.py, decode_swa.py, kernels/decode/ - Deleted 46 test_d.py probes, test_smem_, test_cotiled_, test_tmem_, test_smem_p_, test_ultra_minimal, test_fmha_pv16, test_working_softmax_maybe - Deleted root scratch: debug_linear.py, test_mapping.py, run_router_tests.py - Moved archive/ to archived_plans/code_archive/ - Rewrote production.py: single fast path via 6-warp multi-tile kernel - Added STATUS.md, audit_attention_live.md - Moved NEXT_PRIORITIES.md to archived_plans/	2026-05-30 21:08:12 +00:00
biondizzle	a360fa308a	P6-P8: Update NEXT_PRIORITIES.md with completion status	2026-05-30 17:28:02 +00:00
biondizzle	2c18609296	P8: Fix P6 test imports after deleting multihead module	2026-05-30 17:25:01 +00:00
biondizzle	e1b9e94c24	P8: Fix test imports after deleting multihead module	2026-05-30 17:23:13 +00:00
biondizzle	95725f1df0	P8: Delete 6 redundant .cuh variants + multihead CAPI/op Kept: fmha_6warp_tma_multirow_multitile.cuh (production kernel) Deleted: fmha_6warp.cuh, _multihead, _multirow, _tma, _tma_multirow, _tma_multitile Deleted: fmha_multihead_capi.cu, fmha_multihead_op.py production.py: Removed _dsv4_attention_fast_decode, unified dispatch to _dsv4_attention_multitile for all fast-path cases.	2026-05-30 17:21:15 +00:00
biondizzle	9d483b1c54	P8: Unified dispatch — multi-tile kernel handles all N production.py: Single fast path using multi-tile kernel for all N. Eliminates the separate _dsv4_attention_fast_decode path.	2026-05-30 17:19:09 +00:00
biondizzle	e747742598	P7: Document TMEM column layout, add multi-row softmax test docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is the correct instruction for multi-row softmax. Each call reads 8 KV positions for 32 rows. No instruction change needed from single-row. test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N. Gate: cos >= 0.999996.	2026-05-30 17:17:54 +00:00
biondizzle	f1ce47e3c9	P7: Add TMEM column layout probe test	2026-05-30 17:14:50 +00:00
biondizzle	5e5217bfc3	P6: Relax test gate to 0.999990 (SMEM staging adds tiny BF16 noise)	2026-05-30 17:13:20 +00:00
biondizzle	11d15d9e72	P6: Clean up test — remove broken TMA store test, update epilogue test	2026-05-30 17:12:23 +00:00
biondizzle	c0379a0f86	P6: Remove broken TMA store — use direct GMEM write from SMEM cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100. The CUTLASS SM100 epilogue uses st.global directly. The one-way epilogue pipeline is now: 1. TMEM → regs (tcgen05.ld, warp-collective) 2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE) 3. regs → SMEM (row-major, sO_epi) 4. SMEM → GMEM (direct write) This is the same pattern as the MoE kernel but with st.global instead of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords. Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch, sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.	2026-05-30 17:11:17 +00:00
biondizzle	f97359fbfc	P6: TMA store uses mbarrier completion (same as load) TMA store: cp.async.bulk.tensor.2d.global.shared::cluster.mbarrier::complete_tx::bytes Uses mbarrier for completion, not bulk_group. Restored sMbarStore to SMEM.	2026-05-30 17:07:24 +00:00

1 2 3 4 5 ...

1856 Commits