nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	aafe2eee12	CRITICAL FIX: FP4 LUT was 4x too large! E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24]. The old LUT was 4x the correct values, causing every NVFP4 dequantized weight to be 4x too large. This compounded across 61 layers, causing the residual stream to explode and producing gibberish output. This is the root cause of the residual growth and incoherent generation.	2026-05-31 04:16:13 +00:00
biondizzle	b8c8da91fe	fix: restore RoPE functions that were lost during mHC refactor	2026-05-31 04:10:51 +00:00
biondizzle	3f04a72af4	refactor: use production mHCLayer from dsv4.layers.mhc Replace custom mHCBlock with wrapper around the tested production mHCLayer class. This eliminates any bugs in my custom implementation and uses the same code path that the model was designed for. Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res	2026-05-31 04:06:58 +00:00
biondizzle	b519108cab	fix: restore kv_cache.append that was accidentally removed	2026-05-31 03:56:58 +00:00
biondizzle	22a89b5a45	add attention sinks to SDPA path (paper D5c)	2026-05-31 03:52:59 +00:00
biondizzle	1905f19b8d	fix: define q_input before USE_SDPA branch	2026-05-31 03:45:09 +00:00
biondizzle	cd073ad867	use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet)	2026-05-31 03:42:03 +00:00
biondizzle	171a9e0d10	disable diagnostics for clean production run	2026-05-31 03:32:17 +00:00
biondizzle	3f9b441428	diag: fix n_layers reference in forward_layer, add late-layer diags	2026-05-31 03:28:53 +00:00
biondizzle	5b834a0599	diag: add late-layer diagnostics, fix ffn ctx variable	2026-05-31 03:25:55 +00:00
biondizzle	690c0a1121	CRITICAL FIX: mHC base/scale ordering was wrong Checkpoint order is [pre, post, res] not [pre, res, post]: - base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res - scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res - W_stacked rows: [W_pre(4), W_post(4), W_res(16)] - Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24] This was causing B_l to be near-identity and C_l to be near-2.0, leading to exponential residual stream growth.	2026-05-31 03:16:07 +00:00
biondizzle	c3a2656c48	diag: add FFN and pre_block diagnostics	2026-05-31 03:12:52 +00:00
biondizzle	79ba7e6636	diag: add mHC diagnostics for first 3 layers	2026-05-31 03:10:05 +00:00
biondizzle	a262492e51	fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm	2026-05-31 03:04:53 +00:00
biondizzle	3f12bbc374	fix: move positions tensor to correct GPU for RoPE	2026-05-31 02:54:47 +00:00
biondizzle	0c3d168c60	single_shot: stream weights per-layer from CPU, fix KV/RoPE logic	2026-05-31 02:53:40 +00:00
biondizzle	61160ace13	fix: expert_weights/ids scoping in hash routing path	2026-05-31 02:50:32 +00:00
biondizzle	d772885d7e	single_shot_inference: proper mHC+RMSNorm+inverse RoPE pipeline Major rewrite of single_shot_inference.py: - Replace broken mHC (gentle normalization hack) with proper Sinkhorn-Knopp - Add RMSNorm before each sub-block (attention + FFN) - Add inverse RoPE on attention output (paper §2.3.3) - Fix KV cache: RoPE applied before caching, K=V in DSV4 MQA - Fix MoE: proper dense routing with e_bias, SwiGLU clamping - Proper weight mapping: fn→W_stacked, base→S_pre/S_res/S_post, scale→alphas - Add identity mHC fallback when weights missing - No emergency normalization, no bandaids	2026-05-31 02:45:52 +00:00
biondizzle	523b0e47b1	Add gentle RMSNorm: only clamps when values exceed unit norm	2026-05-31 00:31:34 +00:00
biondizzle	dcbb74841a	Remove emergency RMSNorm from mHC post_block — MoE provides balance now	2026-05-31 00:27:48 +00:00
biondizzle	1de241ccfe	Fix: add all_tokens tracking for decode loop	2026-05-31 00:22:08 +00:00
biondizzle	b1dd59293a	Add prefill: process prompt tokens to fill KV cache before decoding	2026-05-31 00:18:55 +00:00
biondizzle	178fb5483a	Fix KV cache: use index 0 (one-layer cache per layer instance)	2026-05-31 00:14:58 +00:00
biondizzle	afcc690ddc	Add full MoE routing + KV cache to single_shot MoE: - Hash routing (first 3 layers): tid2eid lookup → 6 experts, uniform weights - Dense routing (remaining): sqrt(softplus(gate)) → top-6 → renormalize - 384 NVFP4 experts, each gate+up+down with SiGLU clamping - Weighted combine × routed_scaling_factor + shared expert KV cache: - SimpleKVCache: BF16 flat (1, max_seq, hd) per layer - Appends new K,V each decode step - FMHA now attends over full cached sequence (not just current token) - RoPE applied per-position on K cache This should produce meaningful output — the model now has all architectural components except proper mHC normalization.	2026-05-31 00:11:15 +00:00
biondizzle	3ecfbcba57	Fix T scope in post_block	2026-05-31 00:02:29 +00:00
biondizzle	a493f72681	Add per-residual RMSNorm in mHC post_block (routed MoE missing) Without routed experts, F_out is always positive, causing unbounded growth. Emergency RMSNorm on the residual keeps values bounded. Remove once MoE is wired.	2026-05-30 23:59:19 +00:00
biondizzle	49282fe206	Fix mHC: match vLLM torch reference exactly Key corrections: - RMSNorm applied to projection output (mixes = rsqrt(sqrsum/K + eps)) not to the input before projection - comb_mix uses softmax + Sinkhorn, NOT exp + Sinkhorn - pre_mix = sigmoid(logits) + eps (not matmul with X_l) - layer_input = sum(pre_mix residual) — weighted sum, not bmm - post_mix = sigmoid * hc_post_mult_value (2.0) - bias split: [pre(4), post(4), comb(16)] not [pre(4), comb(16), post(4)]	2026-05-30 23:55:27 +00:00
biondizzle	66a66f8244	Add per-layer NaN tracking for mHC debug	2026-05-30 23:48:32 +00:00
biondizzle	d003c4b7cc	Add mHC (Manifold-Constrained Hyper-Connections) to single_shot - Full mHC pre_block/post_block with Sinkhorn-Knopp normalization - Dynamic A_l (sigmoid), B_l (Birkhoff polytope), C_l (2*sigmoid) - Checkpoint: attn_hc.fn (24,28672) + base (24,) + scale (3,) - Two mHC blocks per layer: attn_hc + ffn_hc - Removed emergency RMSNorm — mHC handles normalization properly - X_l: (1, n_hc=4, H) residual state, init from embedding broadcast	2026-05-30 23:45:18 +00:00
biondizzle	f567c20539	Fix: set active CUDA device per layer for BMM/FMHA	2026-05-30 23:39:45 +00:00
biondizzle	7a95983e0f	Rewrite single_shot: 8-GPU pipeline parallel - Loads all 95 shards, assigns layers round-robin across 8 B200s - ~8 layers per GPU, ~118GB weights per GPU (fits in 183GB) - 3-phase pipeline: load weights → JIT compile → inference - Activations move between GPUs at layer boundaries (NVLink) - No streaming, no shard caching, no per-layer CPU loads - Includes timing for each phase	2026-05-30 23:36:14 +00:00
biondizzle	aac0fa1f08	Update STATUS.md + MEMORY.md: single-shot inference verified	2026-05-30 22:59:27 +00:00
biondizzle	11c010e567	Update output section: kernel verified, architecture gaps noted	2026-05-30 22:58:49 +00:00
biondizzle	53178d2536	Add emergency RMSNorm after residuals (missing mHC fallback) Without mHC, values explode to 761K after first layer. Added per-residual RMSNorm + BF16 clamp to keep values bounded. This won't produce correct model output (mHC is load-bearing), but keeps the pipeline running so we can verify the kernel.	2026-05-30 22:56:16 +00:00
biondizzle	172ba75e0c	Add per-layer NaN check to track where values diverge	2026-05-30 22:54:57 +00:00
biondizzle	ec7846e28c	Add NaN tracking to single_shot_inference	2026-05-30 22:53:09 +00:00
biondizzle	5fa6c88b17	Fix: replace FP4 Inf with 24 (avoid NaN in dequant)	2026-05-30 22:51:10 +00:00
biondizzle	904753f62a	Fix: BMM batch dim alignment for wo_a	2026-05-30 22:49:21 +00:00
biondizzle	52df3bc26c	Fix: wo_a as batched matmul (grouped linear for output projection)	2026-05-30 22:48:31 +00:00
biondizzle	19240608d7	Fix: handle o_a_proj grouped linear shape mismatch	2026-05-30 22:46:12 +00:00
biondizzle	1d02758416	Fix: kv_proj outputs hd=512 (1 KV head MQA), Z from compressor.gate_proj	2026-05-30 22:45:14 +00:00
biondizzle	5dcfb333ea	Fix: move weight tensors to CUDA before dequant	2026-05-30 22:43:47 +00:00
biondizzle	47c7b3c50b	Fix: ensure FP4 LUT on CUDA before index op	2026-05-30 22:43:01 +00:00
biondizzle	13bae9dd55	Fix single_shot: mHC replaces layernorm, no hidden-level norm in DSV4	2026-05-30 22:42:17 +00:00
biondizzle	e8334fc4af	Rewrite single_shot_inference.py — complete forward pass - NVFP4 dequant with proper E2M1 LUT + E4M3 scale + global scale - RoPE (GPT-J partial, last 64 dims) - Q low-rank projection (q_a → q_b) - KV projection (layer-type-aware: HCA/CSA/SWA) - Production FMHA kernel (tcgen05 MMA) - Output projection: o_a (BF16 grouped) → o_b (NVFP4) - Shared expert FFN (gate/up/down, SiLU) - RMSNorm for both attention and FFN - Streaming weight loading (one layer at a time)	2026-05-30 22:40:56 +00:00
biondizzle	9b0858aa35	Add single_shot_inference.py — baseline kernel verification Streams weights one layer at a time from 95 safetensors shards. NVFP4 dequant → BF16 matmul for baseline (production uses tcgen05 MMA). Runs token-by-token decode loop with production FMHA kernel. Known gaps for first run: - FFN (MoE) skipped — not the kernel under test - mHC simplified — not the kernel under test - RoPE skipped in baseline - compressor/indexer bypassed (raw KV for now) FMHA kernel is the component under test (cos ≥ 0.999993).	2026-05-30 22:39:01 +00:00
biondizzle	4472928506	E3: model construction test	2026-05-30 21:22:34 +00:00
biondizzle	afc07a5d1a	Update STATUS.md: E5 done	2026-05-30 21:21:47 +00:00
biondizzle	df6220abaf	E5: Fold batch loop into native kernel grid (blockIdx.z) The 6-warp multi-tile kernel already supports batch natively via dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input. Single kernel launch per layer for batched decode instead of batch_size launches. T>1 prefill still uses per-batch dispatch (E8 future work).	2026-05-30 21:21:02 +00:00
biondizzle	e162a2d112	Update STATUS.md: E1-E4 done	2026-05-30 21:20:10 +00:00

1 2 3 4 5 ...

1879 Commits