Commit Graph

1930 Commits

Author SHA1 Message Date
ca661d32e8 Empty system prompt for testing (was causing model to regurgitate AI assistant tokens) 2026-05-31 19:03:55 +00:00
b09b2cf511 Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+
- Layers 0-2 use hash routing (tid2eid lookup, uniform weights)
- Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only)
- Fixed e_bias key name: e_score_correction_bias (not e_bias)
- Hash routing detection: check tid2eid present AND e_score_correction_bias absent
2026-05-31 18:52:38 +00:00
7d9e70c5d5 Fix remaining mHC API references: layer_compare.py, layer.py comment 2026-05-31 18:38:34 +00:00
7b123d159f CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
   - Was applying Sinkhorn to post values and 2*sigmoid to comb values
   - This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
   - HF: comb.transpose(-1,-2) @ hidden_streams
   - Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
   - HF: softmax → col norm → (iters-1) alternating
   - Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
   - HF: sigmoid(...) + hc_eps
   - Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
   - Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
2026-05-31 18:38:12 +00:00
f6c02f808f Add layer-by-layer comparison test for debugging 2026-05-31 12:48:43 +00:00
6ad577bd18 Add HuggingFace reference comparison test 2026-05-31 12:05:19 +00:00
581c4170f9 Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len) 2026-05-31 11:57:23 +00:00
0f951a0b1a Fix attention sinks: logit bias (HuggingFace reference), not dummy KV
The HuggingFace reference treats attention sinks as a logit bias:
1. Compute raw Q*K scores
2. Concatenate sinks as a logit column
3. Softmax the combined logits
4. DROP the sink column (don't multiply by V)
5. Multiply by V

Our old code added sinks as a dummy zero-KV entry, which diluted
attention weights by adding an extra V=0 position to the softmax.
2026-05-31 11:53:43 +00:00
daed594902 CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj)
The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted
RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q
magnitudes are too large, causing attention scores to collapse to uniform
(entropy ~3.2 with 24 positions) and the model to produce garbage.

q_b_norm has no learnable parameters — just q / RMS(q).
This explains the nearly-uniform attention weights we've been seeing.
2026-05-31 11:47:16 +00:00
dd50c355a6 Fix MHC_DIAG null check when SKIP_MHC is enabled 2026-05-31 11:37:32 +00:00
631e6ea3e4 Add --skip-mhc flag for simple residual diagnostic
When enabled, bypasses mHC pre/post blocks and uses direct residual
connections with 0.1 scaling. This helps isolate whether the mHC
implementation is causing the garbage output.
2026-05-31 11:33:41 +00:00
d201a9334e CRITICAL FIX: Add YaRN RoPE scaling (factor=16)
The DSV4 Pro model uses rope_type='yarn' with factor=16. Our
build_rope_cache was using standard RoPE with theta=10000, completely
ignoring YaRN scaling. This produced wrong cos/sin values for all
positions, causing incorrect attention scores and garbage output.

YaRN modifies the RoPE frequencies:
- High-frequency components: unchanged
- Low-frequency components: scaled by 1/factor
- Medium: smooth interpolation

Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536
2026-05-31 11:25:52 +00:00
88719f39b4 Add single-layer trace (Phase 2.6) for detailed debugging 2026-05-31 11:20:46 +00:00
8256e23aed Fix mHCContext attribute access (not tuple unpacking) and enable attention diag 2026-05-31 11:10:37 +00:00
72c139a59f Enable MHC_DIAG for diagnostic run 2026-05-31 11:07:23 +00:00
cd661c2e40 Add attention and Q/KV diagnostics (MHC_DIAG flag) 2026-05-31 11:07:17 +00:00
9584fcbc23 Fix top5_ids variable name in decode logging 2026-05-31 10:54:40 +00:00
a6d56d10ca Add top-20 logging and thinking token detection in decode loop 2026-05-31 10:49:28 +00:00
d891ae7e96 Fix prompt format: use DeepSeek V4 chat tokens
The model was trained with DeepSeek-specific chat tokens:
  <|User|> (128803), <|Assistant|> (128804), <|EOT|> (128805)
  Thinking: fi (128821), fl (128822)

Previous manual assembly just concatenated raw text without these tokens,
causing the model to not recognize user/assistant boundaries.

Format: <BOS><|User|>system prompt\n\nuser prompt<|Assistant|>
2026-05-31 10:33:41 +00:00
f86742ef8e Cache layer weights on GPU — eliminates per-token CPU→GPU transfer
Previously, each prefill/decode token re-transferred ALL layer weights
from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made
prefill ~36s/token and caused the test to appear stuck.

Now: one-time cache_all_layer_weights() loads all 61 layers to their
target GPUs. Prefill should be ~1-2s/token instead of ~36s.

Also added flush=True to print statements so progress is visible.
2026-05-31 10:28:25 +00:00
ce3d6069cc CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post]
All three mHC parameter tensors (fn, base, scale) share the same ordering
as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)].

Previous code loaded base as [pre(4), post(4), res(16)] and scale as
[alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and
alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to
be computed with wrong bias values, allowing the residual to explode.

Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums,
C_l values) to verify doubly-stochastic constraint is satisfied.
2026-05-31 10:07:14 +00:00
9a43e9aa77 CRITICAL FIX: mHC fn weight row ordering was wrong
fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).

Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong:   fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
2026-05-31 10:02:57 +00:00
0346e479d4 Add system prompt, CLI args, inverse RoPE flag, minimal e2e test
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
  tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
  first 448 always un-RoPE'd, relative encoding may be intentional
2026-05-31 09:56:18 +00:00
429fc3db40 Fix expert weight indexing for 1D tensor 2026-05-31 09:23:10 +00:00
33004dcbf4 Fix expert weight broadcasting (wt.item() for scalar multiply) 2026-05-31 09:22:27 +00:00
1434b35971 Add residual diagnostic test — per-layer magnitude tracking 2026-05-31 09:21:41 +00:00
1c18c16c68 Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16 2026-05-31 09:17:36 +00:00
970869d017 Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected) 2026-05-31 09:17:07 +00:00
a2ee78b564 Fix RoPE shape bug (interleave needs separate even/odd assembly) 2026-05-31 09:15:59 +00:00
9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip
BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16).
This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating
across 61 layers into garbage output. FP32 cache + FP32 arithmetic
gives exact round-trip (diff < 1e-7).

Also fixes: MoE expert loop indentation (was only running last expert).
2026-05-31 09:14:59 +00:00
db74a887ab Add minimal e2e test + fix MoE expert loop bug (indentation) 2026-05-31 09:14:03 +00:00
e195d9d3a7 add SKIP_ROUTED_MOE debug flag, re-enable sinks 2026-05-31 07:02:38 +00:00
4f28673bec debug: disable sinks in SDPA to check |X| impact 2026-05-31 06:51:58 +00:00
e3db90b56c switch back to original prompt 2026-05-31 06:40:01 +00:00
d2cf5ccc32 CRITICAL FIX: use SDPA for short sequences (FMHA padding bug)
FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens),
the 123 padded zero-K entries contribute exp(0)=1 to the softmax
denominator, diluting real attention weights by ~128/5 = 25.6x.

This caused the model to produce incoherent output for short prompts.

Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer
sequences where the padding effect is negligible.

Also: SDPA path includes attention sinks (paper D5c), FMHA path
uses analytic sink correction via LSE.
2026-05-31 06:39:23 +00:00
5f98855141 test with simpler prompt 2026-05-31 06:28:45 +00:00
152af7295a debug: compare FMHA vs SDPA output at layer 0 2026-05-31 06:16:58 +00:00
59c75ca4e9 fix: cast attn_out back to BF16 after sink correction 2026-05-31 06:07:06 +00:00
e5245ea34e fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong 2026-05-31 06:03:13 +00:00
91abf0f921 FMHA + analytic sink bias correction using LSE
Instead of SDPA with virtual sink position, use the production FMHA
kernel and apply the sink bias as a post-hoc correction on the output.

The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink))

This simulates the attention sink (paper D5c) without modifying the
FMHA kernel. The sink absorbs some attention mass, reducing the
normalization constant and scaling down the output.
2026-05-31 05:58:01 +00:00
fac269c938 fix verify_attention: proper multi-head SDPA + GQA 2026-05-31 05:55:10 +00:00
2333fc8b4b fix verify_attention.py: proper nvfp4_linear calls 2026-05-31 05:53:49 +00:00
c09f68c867 add verify_attention.py: single-layer attention component test 2026-05-31 05:51:36 +00:00
04dd7545b3 switch to production FMHA for full run 2026-05-31 04:51:16 +00:00
738088cf49 revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach 2026-05-31 04:51:10 +00:00
781ee43521 try separate K (RoPE'd) and V (raw) — no inverse RoPE needed 2026-05-31 04:46:14 +00:00
889521009b re-enable inverse RoPE (confirmed necessary — without it output is garbage) 2026-05-31 04:45:58 +00:00
92e465ca04 debug: disable inverse RoPE to check impact on output 2026-05-31 04:40:34 +00:00
c69dc51b3b switch to SDPA with sinks (better residual control) 2026-05-31 04:38:41 +00:00
3ed8f3cc44 switch back to production FMHA kernel (with FP4 LUT fix) 2026-05-31 04:32:01 +00:00