nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	be476b2ce2	router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init	2026-06-01 00:00:07 +00:00
biondizzle	56dff8d185	fix: W_gate is (H, E) but F.linear expects (E, H), transpose before linear	2026-05-31 23:55:16 +00:00
biondizzle	5396a04c28	router: broaden except to catch all CuTeDSL errors, fall through to cuBLAS+activation_topk path	2026-05-31 23:54:16 +00:00
biondizzle	3b5b9f487c	fix: compute num_tma_load_bytes inside cute.compile context	2026-05-31 23:53:13 +00:00
biondizzle	1bc0da0f35	fix: properly scope swap code inside else/guard blocks, replace continue with if guard	2026-05-31 23:51:43 +00:00
biondizzle	d0d765e1f2	fix: replace break statements with flag-based loops in router kernel (CuTeDSL restriction)	2026-05-31 23:50:39 +00:00
biondizzle	210391e571	fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape	2026-05-31 23:49:12 +00:00
biondizzle	824d054ad7	fix: inside cute.compile args are already CuTe tensors, no conversion needed	2026-05-31 23:47:33 +00:00
biondizzle	6375e54396	fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router	2026-05-31 23:46:35 +00:00
biondizzle	cb2ca8591f	fix: add @cute.jit to router compiled function	2026-05-31 23:44:53 +00:00
biondizzle	d5d2b7b4b8	fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern)	2026-05-31 23:44:00 +00:00
biondizzle	157f1c5258	fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel	2026-05-31 23:39:50 +00:00
biondizzle	1dbc57e2cd	fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time)	2026-05-31 23:36:01 +00:00
biondizzle	d05dd50bf5	fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API)	2026-05-31 23:34:54 +00:00
biondizzle	a6a8755439	single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128)	2026-05-31 23:33:32 +00:00
biondizzle	80002f2efc	single_shot: production NVFP4 GEMM for ALL attention projections - Nvfp4Linear (CuTeDSL) for q_a, q_b, kv, o_b — NO more dequant+matmul - Production FMHA (6-warp TMA multi-tile) with per-head sink bias - Production MoE + Router + SharedExpert + mHC (unchanged) - wo_a still uses BF16 grouped BMM (checkpoint is BF16) - Compressor/Indexer still PyTorch ref (not yet on tensor cores) - Proper weight dimensions: q_a(7168->1536), q_b(1536->65536), kv(7168->512), o_b(16384->7168)	2026-05-31 23:28:16 +00:00
biondizzle	32efd5139d	Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E)	2026-05-31 23:21:09 +00:00
biondizzle	e45c0ff51b	single_shot: use reference dequant for attn projections, focus on MoE+FMHA Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT triggered by _ensure_initialized). Disable for now to validate the critical paths first: - Production FMHA with sink bias - Production MoE (Nvfp4MoE + Nvfp4SharedExpert) - Production Router (dense/hash) - Production mHC Attention projections use reference dequant+matmul for now. Will re-enable Nvfp4Linear after validating MoE path.	2026-05-31 23:20:04 +00:00
biondizzle	dfbffa1df1	single_shot: CUDA_LAUNCH_BLOCKING for debugging	2026-05-31 23:18:35 +00:00
biondizzle	a66fdf6049	single_shot: add sync to catch CUDA errors early	2026-05-31 23:17:46 +00:00
biondizzle	0b35c36d23	single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init - MoE expert weights loaded per-expert to GPU (no huge CPU tensors) - Nvfp4Linear finalize_weights deferred (lazy on first forward) - Shared expert weights loaded directly to GPU - Added GPU cache cleanup at start - Fixed shared expert finalize_weights (now lazy)	2026-05-31 23:16:45 +00:00
biondizzle	050b5ee449	Fix n_h reference before assignment in single_shot	2026-05-31 23:14:24 +00:00
biondizzle	c5adbbfde6	FMHA sink: don't double-scale sink bias The sink bias from the checkpoint is already in the scaled domain (added to QKscale in the reference softmax). The kernel's running_max is max(QKscale), so the sink should be compared directly without multiplying by scale again.	2026-05-31 23:12:20 +00:00
biondizzle	4adee1207f	FMHA: zero-init my_p_vals to fix N<128 padding NaN When N<128, padded KV positions have my_p_vals[col] uninitialized for col >= kv_len. The PV GEMM then computes garbage_P × zero_V, which can produce NaN on tensor cores (0 × NaN = NaN). Fix: zero-initialize my_p_vals so padded positions contribute 0.	2026-05-31 23:11:12 +00:00
biondizzle	13be3ad443	FMHA sink bias in kernel + single_shot production rewrite FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh): - Added sink_bias field to FmhaTmaMultiRowMultiTileParams - After KV tile loop, sink logit is included in online softmax rescale: new_max = max(running_max, sink_bias * scale) rescale existing O_unnorm and running_sum running_sum += exp(sink_bias * scale - new_max) No PV contribution from sink (D5c: single softmax) - C API: fmha_multitile_decode_launch now takes sink_bias_ptr - Python: fmha_multitile_decode_raw accepts attn_sink tensor single_shot_inference.py: - Full rewrite to use production kernel stack - mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp) - Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b - FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback) - MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback) - Router: production dense/hash dispatch - Compressor/Indexer: reference dequant (not yet on tensor cores) - NO try/except fallbacks on production paths	2026-05-31 23:10:13 +00:00
biondizzle	23e88638aa	single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer) Build stacked (E, N, K) tensors incrementally on CPU, then move to GPU in one shot. Avoids holding 384 individual expert weight+scale tensors on GPU simultaneously (~3x memory savings per layer).	2026-05-31 22:55:11 +00:00
biondizzle	92200367f3	FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128 ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment but then passed the PADDED N to the kernel as s_k (logical KV length). This told the kernel all 128 entries were valid, so softmax ran over zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128). FIX: Pass N_orig (true sequence length) as s_k for softmax masking, and N_padded (physical size) only for TMA descriptor creation. The kernel's existing col < kv_len guard correctly excludes padded entries from row_max and exp_sum calculations. Files changed: - fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for params.s_k and N_padded for TMA descriptors - fmha_multitile_op.py: pass N_orig and N_padded separately - single_shot_inference.py: removed SDPA fallback (kernel now correct)	2026-05-31 22:52:39 +00:00
biondizzle	d40821c843	single_shot: fix memory (no double-loading MoE weights), FMHA short-seq fallback - Don't cache MoE/SE expert weights in layer_w (handled by runners) This saves ~10.6GB/layer × 61 = ~647GB of double-loaded GPU memory - Add FMHA fallback for seq_len < 128 (known kernel limitation: zero-padding dilutes softmax). TODO: fix kernel to mask padded entries. - Free all_w and empty GPU caches after building runners	2026-05-31 22:49:15 +00:00
biondizzle	91568e12d4	single_shot_inference.py: production kernel stack version - FMHA: 6-warp TMA multi-tile kernel via dsv4_attention - MoE: Nvfp4MoE (CuTeDSL NVFP4 grouped GEMM, fused SwiGLU) - Shared expert: Nvfp4SharedExpert (CuTeDSL NVFP4 single-group GEMM) - Router: production dense/hash router kernels - Compressor: CSA/HCA token-level softmax - Indexer: score+topk - mHC: Sinkhorn-Knopp, B_l transposed, [pre,post,comb] - No PyTorch SDPA, no F.linear for kernel paths - Falls back to dequant BF16 only if production kernels fail - FP32 RoPE cache (BF16 destroys cos²+sin²=1)	2026-05-31 22:45:44 +00:00
biondizzle	fb96c34b89	rename: single_shot_inference.py → single_shot_PYTORCH_REFERENCE.py	2026-05-31 22:42:06 +00:00
biondizzle	79d1a83348	Add NEXT_STEPS.md: post v0.1 issues, kernel migration plan, lessons learned	2026-05-31 22:30:34 +00:00
biondizzle	acc20dffd7	CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant input_scale is the activation quantization scale (for FP8 inputs). Since we use BF16 activations, the weight dequant is simply: lut[weight] * weight_scale * weight_scale_2 Folding input_scale in produced weights ~4000x too small, causing all attention and FFN outputs to be effectively zero. v0.1-e2e-working	2026-05-31 22:03:55 +00:00
biondizzle	4e64acbb64	fix MoE gate BF16/NVFP4 handling, add attention diagnostics	2026-05-31 21:57:47 +00:00
biondizzle	0d2b5ceb93	fix positions device mismatch: move to rope cache device in forward_attention	2026-05-31 21:54:56 +00:00
biondizzle	2676476013	fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16	2026-05-31 21:51:59 +00:00
biondizzle	eb08cd06d1	Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected - Fixed weight key format: model.layers.{li}.self_attn.* (was layers.{li}.attn.) - Added NVFP4 two-level scale: weight_scale weight_scale_2 * input_scale - Proper CSA compressor: overlapping Ca/Cb streams, token-level softmax - Proper HCA compressor: non-overlapping, single stream - Indexer: NVFP4 q_b_proj + weights_proj + own compressor at index_head_dim - Compressed KV (dim=hd) concatenated with SWA KV for attention - Correct MoE key format: gate_proj/up_proj/down_proj - Correct mHC key format: attn_hc.{fn,base,scale} and ffn_hc.{fn,base,scale} - No more disconnected compressor — full E2E pipeline	2026-05-31 21:48:59 +00:00
biondizzle	4988e77179	probe key format	2026-05-31 21:42:52 +00:00
biondizzle	ba915dbd53	add probe_shapes script	2026-05-31 21:41:31 +00:00
biondizzle	c54dd15550	find hc keys	2026-05-31 21:38:43 +00:00
biondizzle	52b4971711	Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn) - Fixed checkpoint key prefix: layers.{li}.attn.* and layers.{li}.ffn.* (was incorrectly model.layers.{li}.self_attn.* and .mlp.*) - Added Compressor (CSA ratio=4 overlapping, HCA ratio=128) - Added Indexer (CSA top-k selection) - Compressor wkv/wgate are BF16 (NOT NVFP4 — no .scale) - MoE gate is BF16 (not NVFP4) - Added KV cache with SWA ring buffer + compressed entries - Attention sinks as logit bias (paper D5c) - YaRN RoPE with factor=16 - Proper mHC with Sinkhorn-Knopp - HcHead for final mHC readout - Still TODO: proper compressed KV attention (currently SWA-only)	2026-05-31 21:36:17 +00:00
biondizzle	cec17fee7d	fixed prefix	2026-05-31 21:26:04 +00:00
biondizzle	696f3261ab	focused key dump	2026-05-31 21:25:31 +00:00
biondizzle	b7c9bb1262	dump all keys	2026-05-31 21:24:58 +00:00
biondizzle	54e2a3684a	filter expert keys	2026-05-31 21:24:35 +00:00
biondizzle	bafabda01f	add checkpoint key dump script	2026-05-31 21:24:14 +00:00
biondizzle	23f1cf4065	Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference)	2026-05-31 21:13:21 +00:00
biondizzle	274ea13251	Fix critical bug: add hc_head for final mHC readout (was using stream 0) The model uses DeepseekV4HyperHead to project from the 4-stream mHC residual to the final hidden state. Just taking stream 0 (X[:,0,:]) is WRONG — the hc_head learns how to combine the 4 streams. Also: - Remove --no-thinking mode (this is a reasoning model, it MUST think) - Increase default max_tokens from 512 to 4096 - Load hc_head weights (fn, base, scale) from checkpoint	2026-05-31 21:13:02 +00:00
biondizzle	baee36e728	Fix dtype mismatch in validate_layer: cast flat to float before F.linear	2026-05-31 20:23:18 +00:00
biondizzle	46c4ef2cf5	Add per-layer validation test (tests/validate_layer.py) Compares forward_layer output with step-by-step PyTorch reference to identify where residual blowup originates. Uses our own NVFP4 dequant — no HF dependency.	2026-05-31 20:22:13 +00:00
biondizzle	abe4210367	Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG	2026-05-31 20:21:03 +00:00

1 2 3 4 5 ...

1985 Commits