nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	ca661d32e8	Empty system prompt for testing (was causing model to regurgitate AI assistant tokens)	2026-05-31 19:03:55 +00:00
biondizzle	b09b2cf511	Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+ - Layers 0-2 use hash routing (tid2eid lookup, uniform weights) - Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only) - Fixed e_bias key name: e_score_correction_bias (not e_bias) - Hash routing detection: check tid2eid present AND e_score_correction_bias absent	2026-05-31 18:52:38 +00:00
biondizzle	7d9e70c5d5	Fix remaining mHC API references: layer_compare.py, layer.py comment	2026-05-31 18:38:34 +00:00
biondizzle	7b123d159f	CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection): 1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb] - Was applying Sinkhorn to post values and 2*sigmoid to comb values - This caused residual to grow unbounded (no doubly-stochastic constraint) 2. comb (B_l) must be TRANSPOSED in post_block - HF: comb.transpose(-1,-2) @ hidden_streams - Was using B_l @ X_l without transpose 3. Sinkhorn must start from softmax(logits) + eps, not exp(logits) - HF: softmax → col norm → (iters-1) alternating - Was using exp → alternating (different convergence behavior) 4. Missing hc_eps on pre (A_l) - HF: sigmoid(...) + hc_eps - Was missing the eps guard 5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout - Matches checkpoint naming and HF model 6. Fixed fallback mHC initialization to use new API	2026-05-31 18:38:12 +00:00
biondizzle	f6c02f808f	Add layer-by-layer comparison test for debugging	2026-05-31 12:48:43 +00:00
biondizzle	6ad577bd18	Add HuggingFace reference comparison test	2026-05-31 12:05:19 +00:00
biondizzle	581c4170f9	Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len)	2026-05-31 11:57:23 +00:00
biondizzle	0f951a0b1a	Fix attention sinks: logit bias (HuggingFace reference), not dummy KV The HuggingFace reference treats attention sinks as a logit bias: 1. Compute raw Q*K scores 2. Concatenate sinks as a logit column 3. Softmax the combined logits 4. DROP the sink column (don't multiply by V) 5. Multiply by V Our old code added sinks as a dummy zero-KV entry, which diluted attention weights by adding an extra V=0 position to the softmax.	2026-05-31 11:53:43 +00:00
biondizzle	daed594902	CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj) The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q magnitudes are too large, causing attention scores to collapse to uniform (entropy ~3.2 with 24 positions) and the model to produce garbage. q_b_norm has no learnable parameters — just q / RMS(q). This explains the nearly-uniform attention weights we've been seeing.	2026-05-31 11:47:16 +00:00
biondizzle	dd50c355a6	Fix MHC_DIAG null check when SKIP_MHC is enabled	2026-05-31 11:37:32 +00:00
biondizzle	631e6ea3e4	Add --skip-mhc flag for simple residual diagnostic When enabled, bypasses mHC pre/post blocks and uses direct residual connections with 0.1 scaling. This helps isolate whether the mHC implementation is causing the garbage output.	2026-05-31 11:33:41 +00:00
biondizzle	d201a9334e	CRITICAL FIX: Add YaRN RoPE scaling (factor=16) The DSV4 Pro model uses rope_type='yarn' with factor=16. Our build_rope_cache was using standard RoPE with theta=10000, completely ignoring YaRN scaling. This produced wrong cos/sin values for all positions, causing incorrect attention scores and garbage output. YaRN modifies the RoPE frequencies: - High-frequency components: unchanged - Low-frequency components: scaled by 1/factor - Medium: smooth interpolation Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536	2026-05-31 11:25:52 +00:00
biondizzle	88719f39b4	Add single-layer trace (Phase 2.6) for detailed debugging	2026-05-31 11:20:46 +00:00
biondizzle	8256e23aed	Fix mHCContext attribute access (not tuple unpacking) and enable attention diag	2026-05-31 11:10:37 +00:00
biondizzle	72c139a59f	Enable MHC_DIAG for diagnostic run	2026-05-31 11:07:23 +00:00
biondizzle	cd661c2e40	Add attention and Q/KV diagnostics (MHC_DIAG flag)	2026-05-31 11:07:17 +00:00
biondizzle	9584fcbc23	Fix top5_ids variable name in decode logging	2026-05-31 10:54:40 +00:00
biondizzle	a6d56d10ca	Add top-20 logging and thinking token detection in decode loop	2026-05-31 10:49:28 +00:00
biondizzle	d891ae7e96	Fix prompt format: use DeepSeek V4 chat tokens The model was trained with DeepSeek-specific chat tokens: <｜User｜> (128803), <｜Assistant｜> (128804), <\|EOT\|> (128805) Thinking: ﬁ (128821), ﬂ (128822) Previous manual assembly just concatenated raw text without these tokens, causing the model to not recognize user/assistant boundaries. Format: <BOS><｜User｜>system prompt\n\nuser prompt<｜Assistant｜>	2026-05-31 10:33:41 +00:00
biondizzle	f86742ef8e	Cache layer weights on GPU — eliminates per-token CPU→GPU transfer Previously, each prefill/decode token re-transferred ALL layer weights from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made prefill ~36s/token and caused the test to appear stuck. Now: one-time cache_all_layer_weights() loads all 61 layers to their target GPUs. Prefill should be ~1-2s/token instead of ~36s. Also added flush=True to print statements so progress is visible.	2026-05-31 10:28:25 +00:00
biondizzle	ce3d6069cc	CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post] All three mHC parameter tensors (fn, base, scale) share the same ordering as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)]. Previous code loaded base as [pre(4), post(4), res(16)] and scale as [alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to be computed with wrong bias values, allowing the residual to explode. Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums, C_l values) to verify doubly-stochastic constraint is satisfied.	2026-05-31 10:07:14 +00:00
biondizzle	9a43e9aa77	CRITICAL FIX: mHC fn weight row ordering was wrong fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw] in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the residual to explode (\|X\| 0.8 → 160K across 61 layers). Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res	2026-05-31 10:02:57 +00:00
biondizzle	0346e479d4	Add system prompt, CLI args, inverse RoPE flag, minimal e2e test - System prompt added via chat template (reasoning model needs instructions) - MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens) - --no-inverse-rope flag to test without inverse RoPE on attn output - --skip-moe flag to debug with shared expert only - --max-tokens and --prompt CLI overrides - minimal_e2e_test(): processes 'The' through full model, checks logits, tracks per-layer residual stream evolution, reports NaN/Inf/spread - INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims, first 448 always un-RoPE'd, relative encoding may be intentional	2026-05-31 09:56:18 +00:00
biondizzle	429fc3db40	Fix expert weight indexing for 1D tensor	2026-05-31 09:23:10 +00:00
biondizzle	33004dcbf4	Fix expert weight broadcasting (wt.item() for scalar multiply)	2026-05-31 09:22:27 +00:00
biondizzle	1434b35971	Add residual diagnostic test — per-layer magnitude tracking	2026-05-31 09:21:41 +00:00
biondizzle	1c18c16c68	Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16	2026-05-31 09:17:36 +00:00
biondizzle	970869d017	Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected)	2026-05-31 09:17:07 +00:00
biondizzle	a2ee78b564	Fix RoPE shape bug (interleave needs separate even/odd assembly)	2026-05-31 09:15:59 +00:00
biondizzle	9d96c2fbbf	CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16). This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating across 61 layers into garbage output. FP32 cache + FP32 arithmetic gives exact round-trip (diff < 1e-7). Also fixes: MoE expert loop indentation (was only running last expert).	2026-05-31 09:14:59 +00:00
biondizzle	db74a887ab	Add minimal e2e test + fix MoE expert loop bug (indentation)	2026-05-31 09:14:03 +00:00
biondizzle	e195d9d3a7	add SKIP_ROUTED_MOE debug flag, re-enable sinks	2026-05-31 07:02:38 +00:00
biondizzle	4f28673bec	debug: disable sinks in SDPA to check \|X\| impact	2026-05-31 06:51:58 +00:00
biondizzle	e3db90b56c	switch back to original prompt	2026-05-31 06:40:01 +00:00
biondizzle	d2cf5ccc32	CRITICAL FIX: use SDPA for short sequences (FMHA padding bug) FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens), the 123 padded zero-K entries contribute exp(0)=1 to the softmax denominator, diluting real attention weights by ~128/5 = 25.6x. This caused the model to produce incoherent output for short prompts. Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer sequences where the padding effect is negligible. Also: SDPA path includes attention sinks (paper D5c), FMHA path uses analytic sink correction via LSE.	2026-05-31 06:39:23 +00:00
biondizzle	5f98855141	test with simpler prompt	2026-05-31 06:28:45 +00:00
biondizzle	152af7295a	debug: compare FMHA vs SDPA output at layer 0	2026-05-31 06:16:58 +00:00
biondizzle	59c75ca4e9	fix: cast attn_out back to BF16 after sink correction	2026-05-31 06:07:06 +00:00
biondizzle	e5245ea34e	fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong	2026-05-31 06:03:13 +00:00
biondizzle	91abf0f921	FMHA + analytic sink bias correction using LSE Instead of SDPA with virtual sink position, use the production FMHA kernel and apply the sink bias as a post-hoc correction on the output. The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink)) This simulates the attention sink (paper D5c) without modifying the FMHA kernel. The sink absorbs some attention mass, reducing the normalization constant and scaling down the output.	2026-05-31 05:58:01 +00:00
biondizzle	fac269c938	fix verify_attention: proper multi-head SDPA + GQA	2026-05-31 05:55:10 +00:00
biondizzle	2333fc8b4b	fix verify_attention.py: proper nvfp4_linear calls	2026-05-31 05:53:49 +00:00
biondizzle	c09f68c867	add verify_attention.py: single-layer attention component test	2026-05-31 05:51:36 +00:00
biondizzle	04dd7545b3	switch to production FMHA for full run	2026-05-31 04:51:16 +00:00
biondizzle	738088cf49	revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach	2026-05-31 04:51:10 +00:00
biondizzle	781ee43521	try separate K (RoPE'd) and V (raw) — no inverse RoPE needed	2026-05-31 04:46:14 +00:00
biondizzle	889521009b	re-enable inverse RoPE (confirmed necessary — without it output is garbage)	2026-05-31 04:45:58 +00:00
biondizzle	92e465ca04	debug: disable inverse RoPE to check impact on output	2026-05-31 04:40:34 +00:00
biondizzle	c69dc51b3b	switch to SDPA with sinks (better residual control)	2026-05-31 04:38:41 +00:00
biondizzle	3ed8f3cc44	switch back to production FMHA kernel (with FP4 LUT fix)	2026-05-31 04:32:01 +00:00

1 2 3 4 5 ...

1930 Commits