nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	59c75ca4e9	fix: cast attn_out back to BF16 after sink correction	2026-05-31 06:07:06 +00:00
biondizzle	e5245ea34e	fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong	2026-05-31 06:03:13 +00:00
biondizzle	91abf0f921	FMHA + analytic sink bias correction using LSE Instead of SDPA with virtual sink position, use the production FMHA kernel and apply the sink bias as a post-hoc correction on the output. The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink)) This simulates the attention sink (paper D5c) without modifying the FMHA kernel. The sink absorbs some attention mass, reducing the normalization constant and scaling down the output.	2026-05-31 05:58:01 +00:00
biondizzle	fac269c938	fix verify_attention: proper multi-head SDPA + GQA	2026-05-31 05:55:10 +00:00
biondizzle	2333fc8b4b	fix verify_attention.py: proper nvfp4_linear calls	2026-05-31 05:53:49 +00:00
biondizzle	c09f68c867	add verify_attention.py: single-layer attention component test	2026-05-31 05:51:36 +00:00
biondizzle	04dd7545b3	switch to production FMHA for full run	2026-05-31 04:51:16 +00:00
biondizzle	738088cf49	revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach	2026-05-31 04:51:10 +00:00
biondizzle	781ee43521	try separate K (RoPE'd) and V (raw) — no inverse RoPE needed	2026-05-31 04:46:14 +00:00
biondizzle	889521009b	re-enable inverse RoPE (confirmed necessary — without it output is garbage)	2026-05-31 04:45:58 +00:00
biondizzle	92e465ca04	debug: disable inverse RoPE to check impact on output	2026-05-31 04:40:34 +00:00
biondizzle	c69dc51b3b	switch to SDPA with sinks (better residual control)	2026-05-31 04:38:41 +00:00
biondizzle	3ed8f3cc44	switch back to production FMHA kernel (with FP4 LUT fix)	2026-05-31 04:32:01 +00:00
biondizzle	ae79bd8fce	debug: add top-5 logit predictions	2026-05-31 04:25:01 +00:00
biondizzle	aafe2eee12	CRITICAL FIX: FP4 LUT was 4x too large! E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24]. The old LUT was 4x the correct values, causing every NVFP4 dequantized weight to be 4x too large. This compounded across 61 layers, causing the residual stream to explode and producing gibberish output. This is the root cause of the residual growth and incoherent generation.	2026-05-31 04:16:13 +00:00
biondizzle	b8c8da91fe	fix: restore RoPE functions that were lost during mHC refactor	2026-05-31 04:10:51 +00:00
biondizzle	3f04a72af4	refactor: use production mHCLayer from dsv4.layers.mhc Replace custom mHCBlock with wrapper around the tested production mHCLayer class. This eliminates any bugs in my custom implementation and uses the same code path that the model was designed for. Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res	2026-05-31 04:06:58 +00:00
biondizzle	b519108cab	fix: restore kv_cache.append that was accidentally removed	2026-05-31 03:56:58 +00:00
biondizzle	22a89b5a45	add attention sinks to SDPA path (paper D5c)	2026-05-31 03:52:59 +00:00
biondizzle	1905f19b8d	fix: define q_input before USE_SDPA branch	2026-05-31 03:45:09 +00:00
biondizzle	cd073ad867	use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet)	2026-05-31 03:42:03 +00:00
biondizzle	171a9e0d10	disable diagnostics for clean production run	2026-05-31 03:32:17 +00:00
biondizzle	3f9b441428	diag: fix n_layers reference in forward_layer, add late-layer diags	2026-05-31 03:28:53 +00:00
biondizzle	5b834a0599	diag: add late-layer diagnostics, fix ffn ctx variable	2026-05-31 03:25:55 +00:00
biondizzle	690c0a1121	CRITICAL FIX: mHC base/scale ordering was wrong Checkpoint order is [pre, post, res] not [pre, res, post]: - base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res - scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res - W_stacked rows: [W_pre(4), W_post(4), W_res(16)] - Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24] This was causing B_l to be near-identity and C_l to be near-2.0, leading to exponential residual stream growth.	2026-05-31 03:16:07 +00:00
biondizzle	c3a2656c48	diag: add FFN and pre_block diagnostics	2026-05-31 03:12:52 +00:00
biondizzle	79ba7e6636	diag: add mHC diagnostics for first 3 layers	2026-05-31 03:10:05 +00:00
biondizzle	a262492e51	fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm	2026-05-31 03:04:53 +00:00
biondizzle	3f12bbc374	fix: move positions tensor to correct GPU for RoPE	2026-05-31 02:54:47 +00:00
biondizzle	0c3d168c60	single_shot: stream weights per-layer from CPU, fix KV/RoPE logic	2026-05-31 02:53:40 +00:00
biondizzle	61160ace13	fix: expert_weights/ids scoping in hash routing path	2026-05-31 02:50:32 +00:00
biondizzle	d772885d7e	single_shot_inference: proper mHC+RMSNorm+inverse RoPE pipeline Major rewrite of single_shot_inference.py: - Replace broken mHC (gentle normalization hack) with proper Sinkhorn-Knopp - Add RMSNorm before each sub-block (attention + FFN) - Add inverse RoPE on attention output (paper §2.3.3) - Fix KV cache: RoPE applied before caching, K=V in DSV4 MQA - Fix MoE: proper dense routing with e_bias, SwiGLU clamping - Proper weight mapping: fn→W_stacked, base→S_pre/S_res/S_post, scale→alphas - Add identity mHC fallback when weights missing - No emergency normalization, no bandaids	2026-05-31 02:45:52 +00:00
biondizzle	523b0e47b1	Add gentle RMSNorm: only clamps when values exceed unit norm	2026-05-31 00:31:34 +00:00
biondizzle	dcbb74841a	Remove emergency RMSNorm from mHC post_block — MoE provides balance now	2026-05-31 00:27:48 +00:00
biondizzle	1de241ccfe	Fix: add all_tokens tracking for decode loop	2026-05-31 00:22:08 +00:00
biondizzle	b1dd59293a	Add prefill: process prompt tokens to fill KV cache before decoding	2026-05-31 00:18:55 +00:00
biondizzle	178fb5483a	Fix KV cache: use index 0 (one-layer cache per layer instance)	2026-05-31 00:14:58 +00:00
biondizzle	afcc690ddc	Add full MoE routing + KV cache to single_shot MoE: - Hash routing (first 3 layers): tid2eid lookup → 6 experts, uniform weights - Dense routing (remaining): sqrt(softplus(gate)) → top-6 → renormalize - 384 NVFP4 experts, each gate+up+down with SiGLU clamping - Weighted combine × routed_scaling_factor + shared expert KV cache: - SimpleKVCache: BF16 flat (1, max_seq, hd) per layer - Appends new K,V each decode step - FMHA now attends over full cached sequence (not just current token) - RoPE applied per-position on K cache This should produce meaningful output — the model now has all architectural components except proper mHC normalization.	2026-05-31 00:11:15 +00:00
biondizzle	3ecfbcba57	Fix T scope in post_block	2026-05-31 00:02:29 +00:00
biondizzle	a493f72681	Add per-residual RMSNorm in mHC post_block (routed MoE missing) Without routed experts, F_out is always positive, causing unbounded growth. Emergency RMSNorm on the residual keeps values bounded. Remove once MoE is wired.	2026-05-30 23:59:19 +00:00
biondizzle	49282fe206	Fix mHC: match vLLM torch reference exactly Key corrections: - RMSNorm applied to projection output (mixes = rsqrt(sqrsum/K + eps)) not to the input before projection - comb_mix uses softmax + Sinkhorn, NOT exp + Sinkhorn - pre_mix = sigmoid(logits) + eps (not matmul with X_l) - layer_input = sum(pre_mix residual) — weighted sum, not bmm - post_mix = sigmoid * hc_post_mult_value (2.0) - bias split: [pre(4), post(4), comb(16)] not [pre(4), comb(16), post(4)]	2026-05-30 23:55:27 +00:00
biondizzle	66a66f8244	Add per-layer NaN tracking for mHC debug	2026-05-30 23:48:32 +00:00
biondizzle	d003c4b7cc	Add mHC (Manifold-Constrained Hyper-Connections) to single_shot - Full mHC pre_block/post_block with Sinkhorn-Knopp normalization - Dynamic A_l (sigmoid), B_l (Birkhoff polytope), C_l (2*sigmoid) - Checkpoint: attn_hc.fn (24,28672) + base (24,) + scale (3,) - Two mHC blocks per layer: attn_hc + ffn_hc - Removed emergency RMSNorm — mHC handles normalization properly - X_l: (1, n_hc=4, H) residual state, init from embedding broadcast	2026-05-30 23:45:18 +00:00
biondizzle	f567c20539	Fix: set active CUDA device per layer for BMM/FMHA	2026-05-30 23:39:45 +00:00
biondizzle	7a95983e0f	Rewrite single_shot: 8-GPU pipeline parallel - Loads all 95 shards, assigns layers round-robin across 8 B200s - ~8 layers per GPU, ~118GB weights per GPU (fits in 183GB) - 3-phase pipeline: load weights → JIT compile → inference - Activations move between GPUs at layer boundaries (NVLink) - No streaming, no shard caching, no per-layer CPU loads - Includes timing for each phase	2026-05-30 23:36:14 +00:00
biondizzle	aac0fa1f08	Update STATUS.md + MEMORY.md: single-shot inference verified	2026-05-30 22:59:27 +00:00
biondizzle	11c010e567	Update output section: kernel verified, architecture gaps noted	2026-05-30 22:58:49 +00:00
biondizzle	53178d2536	Add emergency RMSNorm after residuals (missing mHC fallback) Without mHC, values explode to 761K after first layer. Added per-residual RMSNorm + BF16 clamp to keep values bounded. This won't produce correct model output (mHC is load-bearing), but keeps the pipeline running so we can verify the kernel.	2026-05-30 22:56:16 +00:00
biondizzle	172ba75e0c	Add per-layer NaN check to track where values diverge	2026-05-30 22:54:57 +00:00
biondizzle	ec7846e28c	Add NaN tracking to single_shot_inference	2026-05-30 22:53:09 +00:00

1 2 3 4 5 ...

1893 Commits