nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	ca661d32e8	Empty system prompt for testing (was causing model to regurgitate AI assistant tokens)	2026-05-31 19:03:55 +00:00
biondizzle	b09b2cf511	Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+ - Layers 0-2 use hash routing (tid2eid lookup, uniform weights) - Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only) - Fixed e_bias key name: e_score_correction_bias (not e_bias) - Hash routing detection: check tid2eid present AND e_score_correction_bias absent	2026-05-31 18:52:38 +00:00
biondizzle	7b123d159f	CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection): 1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb] - Was applying Sinkhorn to post values and 2*sigmoid to comb values - This caused residual to grow unbounded (no doubly-stochastic constraint) 2. comb (B_l) must be TRANSPOSED in post_block - HF: comb.transpose(-1,-2) @ hidden_streams - Was using B_l @ X_l without transpose 3. Sinkhorn must start from softmax(logits) + eps, not exp(logits) - HF: softmax → col norm → (iters-1) alternating - Was using exp → alternating (different convergence behavior) 4. Missing hc_eps on pre (A_l) - HF: sigmoid(...) + hc_eps - Was missing the eps guard 5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout - Matches checkpoint naming and HF model 6. Fixed fallback mHC initialization to use new API	2026-05-31 18:38:12 +00:00
biondizzle	581c4170f9	Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len)	2026-05-31 11:57:23 +00:00
biondizzle	0f951a0b1a	Fix attention sinks: logit bias (HuggingFace reference), not dummy KV The HuggingFace reference treats attention sinks as a logit bias: 1. Compute raw Q*K scores 2. Concatenate sinks as a logit column 3. Softmax the combined logits 4. DROP the sink column (don't multiply by V) 5. Multiply by V Our old code added sinks as a dummy zero-KV entry, which diluted attention weights by adding an extra V=0 position to the softmax.	2026-05-31 11:53:43 +00:00
biondizzle	daed594902	CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj) The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q magnitudes are too large, causing attention scores to collapse to uniform (entropy ~3.2 with 24 positions) and the model to produce garbage. q_b_norm has no learnable parameters — just q / RMS(q). This explains the nearly-uniform attention weights we've been seeing.	2026-05-31 11:47:16 +00:00
biondizzle	dd50c355a6	Fix MHC_DIAG null check when SKIP_MHC is enabled	2026-05-31 11:37:32 +00:00
biondizzle	631e6ea3e4	Add --skip-mhc flag for simple residual diagnostic When enabled, bypasses mHC pre/post blocks and uses direct residual connections with 0.1 scaling. This helps isolate whether the mHC implementation is causing the garbage output.	2026-05-31 11:33:41 +00:00
biondizzle	d201a9334e	CRITICAL FIX: Add YaRN RoPE scaling (factor=16) The DSV4 Pro model uses rope_type='yarn' with factor=16. Our build_rope_cache was using standard RoPE with theta=10000, completely ignoring YaRN scaling. This produced wrong cos/sin values for all positions, causing incorrect attention scores and garbage output. YaRN modifies the RoPE frequencies: - High-frequency components: unchanged - Low-frequency components: scaled by 1/factor - Medium: smooth interpolation Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536	2026-05-31 11:25:52 +00:00
biondizzle	88719f39b4	Add single-layer trace (Phase 2.6) for detailed debugging	2026-05-31 11:20:46 +00:00
biondizzle	8256e23aed	Fix mHCContext attribute access (not tuple unpacking) and enable attention diag	2026-05-31 11:10:37 +00:00
biondizzle	72c139a59f	Enable MHC_DIAG for diagnostic run	2026-05-31 11:07:23 +00:00
biondizzle	cd661c2e40	Add attention and Q/KV diagnostics (MHC_DIAG flag)	2026-05-31 11:07:17 +00:00
biondizzle	9584fcbc23	Fix top5_ids variable name in decode logging	2026-05-31 10:54:40 +00:00
biondizzle	a6d56d10ca	Add top-20 logging and thinking token detection in decode loop	2026-05-31 10:49:28 +00:00
biondizzle	d891ae7e96	Fix prompt format: use DeepSeek V4 chat tokens The model was trained with DeepSeek-specific chat tokens: <｜User｜> (128803), <｜Assistant｜> (128804), <\|EOT\|> (128805) Thinking: ﬁ (128821), ﬂ (128822) Previous manual assembly just concatenated raw text without these tokens, causing the model to not recognize user/assistant boundaries. Format: <BOS><｜User｜>system prompt\n\nuser prompt<｜Assistant｜>	2026-05-31 10:33:41 +00:00
biondizzle	f86742ef8e	Cache layer weights on GPU — eliminates per-token CPU→GPU transfer Previously, each prefill/decode token re-transferred ALL layer weights from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made prefill ~36s/token and caused the test to appear stuck. Now: one-time cache_all_layer_weights() loads all 61 layers to their target GPUs. Prefill should be ~1-2s/token instead of ~36s. Also added flush=True to print statements so progress is visible.	2026-05-31 10:28:25 +00:00
biondizzle	ce3d6069cc	CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post] All three mHC parameter tensors (fn, base, scale) share the same ordering as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)]. Previous code loaded base as [pre(4), post(4), res(16)] and scale as [alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to be computed with wrong bias values, allowing the residual to explode. Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums, C_l values) to verify doubly-stochastic constraint is satisfied.	2026-05-31 10:07:14 +00:00
biondizzle	9a43e9aa77	CRITICAL FIX: mHC fn weight row ordering was wrong fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw] in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the residual to explode (\|X\| 0.8 → 160K across 61 layers). Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res	2026-05-31 10:02:57 +00:00
biondizzle	0346e479d4	Add system prompt, CLI args, inverse RoPE flag, minimal e2e test - System prompt added via chat template (reasoning model needs instructions) - MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens) - --no-inverse-rope flag to test without inverse RoPE on attn output - --skip-moe flag to debug with shared expert only - --max-tokens and --prompt CLI overrides - minimal_e2e_test(): processes 'The' through full model, checks logits, tracks per-layer residual stream evolution, reports NaN/Inf/spread - INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims, first 448 always un-RoPE'd, relative encoding may be intentional	2026-05-31 09:56:18 +00:00
biondizzle	33004dcbf4	Fix expert weight broadcasting (wt.item() for scalar multiply)	2026-05-31 09:22:27 +00:00
biondizzle	a2ee78b564	Fix RoPE shape bug (interleave needs separate even/odd assembly)	2026-05-31 09:15:59 +00:00
biondizzle	9d96c2fbbf	CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16). This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating across 61 layers into garbage output. FP32 cache + FP32 arithmetic gives exact round-trip (diff < 1e-7). Also fixes: MoE expert loop indentation (was only running last expert).	2026-05-31 09:14:59 +00:00
biondizzle	db74a887ab	Add minimal e2e test + fix MoE expert loop bug (indentation)	2026-05-31 09:14:03 +00:00
biondizzle	e195d9d3a7	add SKIP_ROUTED_MOE debug flag, re-enable sinks	2026-05-31 07:02:38 +00:00
biondizzle	4f28673bec	debug: disable sinks in SDPA to check \|X\| impact	2026-05-31 06:51:58 +00:00
biondizzle	e3db90b56c	switch back to original prompt	2026-05-31 06:40:01 +00:00
biondizzle	d2cf5ccc32	CRITICAL FIX: use SDPA for short sequences (FMHA padding bug) FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens), the 123 padded zero-K entries contribute exp(0)=1 to the softmax denominator, diluting real attention weights by ~128/5 = 25.6x. This caused the model to produce incoherent output for short prompts. Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer sequences where the padding effect is negligible. Also: SDPA path includes attention sinks (paper D5c), FMHA path uses analytic sink correction via LSE.	2026-05-31 06:39:23 +00:00
biondizzle	5f98855141	test with simpler prompt	2026-05-31 06:28:45 +00:00
biondizzle	152af7295a	debug: compare FMHA vs SDPA output at layer 0	2026-05-31 06:16:58 +00:00
biondizzle	59c75ca4e9	fix: cast attn_out back to BF16 after sink correction	2026-05-31 06:07:06 +00:00
biondizzle	e5245ea34e	fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong	2026-05-31 06:03:13 +00:00
biondizzle	91abf0f921	FMHA + analytic sink bias correction using LSE Instead of SDPA with virtual sink position, use the production FMHA kernel and apply the sink bias as a post-hoc correction on the output. The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink)) This simulates the attention sink (paper D5c) without modifying the FMHA kernel. The sink absorbs some attention mass, reducing the normalization constant and scaling down the output.	2026-05-31 05:58:01 +00:00
biondizzle	04dd7545b3	switch to production FMHA for full run	2026-05-31 04:51:16 +00:00
biondizzle	738088cf49	revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach	2026-05-31 04:51:10 +00:00
biondizzle	781ee43521	try separate K (RoPE'd) and V (raw) — no inverse RoPE needed	2026-05-31 04:46:14 +00:00
biondizzle	889521009b	re-enable inverse RoPE (confirmed necessary — without it output is garbage)	2026-05-31 04:45:58 +00:00
biondizzle	92e465ca04	debug: disable inverse RoPE to check impact on output	2026-05-31 04:40:34 +00:00
biondizzle	c69dc51b3b	switch to SDPA with sinks (better residual control)	2026-05-31 04:38:41 +00:00
biondizzle	3ed8f3cc44	switch back to production FMHA kernel (with FP4 LUT fix)	2026-05-31 04:32:01 +00:00
biondizzle	ae79bd8fce	debug: add top-5 logit predictions	2026-05-31 04:25:01 +00:00
biondizzle	aafe2eee12	CRITICAL FIX: FP4 LUT was 4x too large! E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24]. The old LUT was 4x the correct values, causing every NVFP4 dequantized weight to be 4x too large. This compounded across 61 layers, causing the residual stream to explode and producing gibberish output. This is the root cause of the residual growth and incoherent generation.	2026-05-31 04:16:13 +00:00
biondizzle	b8c8da91fe	fix: restore RoPE functions that were lost during mHC refactor	2026-05-31 04:10:51 +00:00
biondizzle	3f04a72af4	refactor: use production mHCLayer from dsv4.layers.mhc Replace custom mHCBlock with wrapper around the tested production mHCLayer class. This eliminates any bugs in my custom implementation and uses the same code path that the model was designed for. Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res	2026-05-31 04:06:58 +00:00
biondizzle	b519108cab	fix: restore kv_cache.append that was accidentally removed	2026-05-31 03:56:58 +00:00
biondizzle	22a89b5a45	add attention sinks to SDPA path (paper D5c)	2026-05-31 03:52:59 +00:00
biondizzle	1905f19b8d	fix: define q_input before USE_SDPA branch	2026-05-31 03:45:09 +00:00
biondizzle	cd073ad867	use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet)	2026-05-31 03:42:03 +00:00
biondizzle	171a9e0d10	disable diagnostics for clean production run	2026-05-31 03:32:17 +00:00
biondizzle	3f9b441428	diag: fix n_layers reference in forward_layer, add late-layer diags	2026-05-31 03:28:53 +00:00

1 2

86 Commits