Commit Graph

86 Commits

Author SHA1 Message Date
ca661d32e8 Empty system prompt for testing (was causing model to regurgitate AI assistant tokens) 2026-05-31 19:03:55 +00:00
b09b2cf511 Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+
- Layers 0-2 use hash routing (tid2eid lookup, uniform weights)
- Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only)
- Fixed e_bias key name: e_score_correction_bias (not e_bias)
- Hash routing detection: check tid2eid present AND e_score_correction_bias absent
2026-05-31 18:52:38 +00:00
7b123d159f CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
   - Was applying Sinkhorn to post values and 2*sigmoid to comb values
   - This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
   - HF: comb.transpose(-1,-2) @ hidden_streams
   - Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
   - HF: softmax → col norm → (iters-1) alternating
   - Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
   - HF: sigmoid(...) + hc_eps
   - Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
   - Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
2026-05-31 18:38:12 +00:00
581c4170f9 Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len) 2026-05-31 11:57:23 +00:00
0f951a0b1a Fix attention sinks: logit bias (HuggingFace reference), not dummy KV
The HuggingFace reference treats attention sinks as a logit bias:
1. Compute raw Q*K scores
2. Concatenate sinks as a logit column
3. Softmax the combined logits
4. DROP the sink column (don't multiply by V)
5. Multiply by V

Our old code added sinks as a dummy zero-KV entry, which diluted
attention weights by adding an extra V=0 position to the softmax.
2026-05-31 11:53:43 +00:00
daed594902 CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj)
The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted
RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q
magnitudes are too large, causing attention scores to collapse to uniform
(entropy ~3.2 with 24 positions) and the model to produce garbage.

q_b_norm has no learnable parameters — just q / RMS(q).
This explains the nearly-uniform attention weights we've been seeing.
2026-05-31 11:47:16 +00:00
dd50c355a6 Fix MHC_DIAG null check when SKIP_MHC is enabled 2026-05-31 11:37:32 +00:00
631e6ea3e4 Add --skip-mhc flag for simple residual diagnostic
When enabled, bypasses mHC pre/post blocks and uses direct residual
connections with 0.1 scaling. This helps isolate whether the mHC
implementation is causing the garbage output.
2026-05-31 11:33:41 +00:00
d201a9334e CRITICAL FIX: Add YaRN RoPE scaling (factor=16)
The DSV4 Pro model uses rope_type='yarn' with factor=16. Our
build_rope_cache was using standard RoPE with theta=10000, completely
ignoring YaRN scaling. This produced wrong cos/sin values for all
positions, causing incorrect attention scores and garbage output.

YaRN modifies the RoPE frequencies:
- High-frequency components: unchanged
- Low-frequency components: scaled by 1/factor
- Medium: smooth interpolation

Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536
2026-05-31 11:25:52 +00:00
88719f39b4 Add single-layer trace (Phase 2.6) for detailed debugging 2026-05-31 11:20:46 +00:00
8256e23aed Fix mHCContext attribute access (not tuple unpacking) and enable attention diag 2026-05-31 11:10:37 +00:00
72c139a59f Enable MHC_DIAG for diagnostic run 2026-05-31 11:07:23 +00:00
cd661c2e40 Add attention and Q/KV diagnostics (MHC_DIAG flag) 2026-05-31 11:07:17 +00:00
9584fcbc23 Fix top5_ids variable name in decode logging 2026-05-31 10:54:40 +00:00
a6d56d10ca Add top-20 logging and thinking token detection in decode loop 2026-05-31 10:49:28 +00:00
d891ae7e96 Fix prompt format: use DeepSeek V4 chat tokens
The model was trained with DeepSeek-specific chat tokens:
  <|User|> (128803), <|Assistant|> (128804), <|EOT|> (128805)
  Thinking: fi (128821), fl (128822)

Previous manual assembly just concatenated raw text without these tokens,
causing the model to not recognize user/assistant boundaries.

Format: <BOS><|User|>system prompt\n\nuser prompt<|Assistant|>
2026-05-31 10:33:41 +00:00
f86742ef8e Cache layer weights on GPU — eliminates per-token CPU→GPU transfer
Previously, each prefill/decode token re-transferred ALL layer weights
from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made
prefill ~36s/token and caused the test to appear stuck.

Now: one-time cache_all_layer_weights() loads all 61 layers to their
target GPUs. Prefill should be ~1-2s/token instead of ~36s.

Also added flush=True to print statements so progress is visible.
2026-05-31 10:28:25 +00:00
ce3d6069cc CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post]
All three mHC parameter tensors (fn, base, scale) share the same ordering
as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)].

Previous code loaded base as [pre(4), post(4), res(16)] and scale as
[alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and
alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to
be computed with wrong bias values, allowing the residual to explode.

Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums,
C_l values) to verify doubly-stochastic constraint is satisfied.
2026-05-31 10:07:14 +00:00
9a43e9aa77 CRITICAL FIX: mHC fn weight row ordering was wrong
fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).

Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong:   fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
2026-05-31 10:02:57 +00:00
0346e479d4 Add system prompt, CLI args, inverse RoPE flag, minimal e2e test
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
  tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
  first 448 always un-RoPE'd, relative encoding may be intentional
2026-05-31 09:56:18 +00:00
33004dcbf4 Fix expert weight broadcasting (wt.item() for scalar multiply) 2026-05-31 09:22:27 +00:00
a2ee78b564 Fix RoPE shape bug (interleave needs separate even/odd assembly) 2026-05-31 09:15:59 +00:00
9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip
BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16).
This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating
across 61 layers into garbage output. FP32 cache + FP32 arithmetic
gives exact round-trip (diff < 1e-7).

Also fixes: MoE expert loop indentation (was only running last expert).
2026-05-31 09:14:59 +00:00
db74a887ab Add minimal e2e test + fix MoE expert loop bug (indentation) 2026-05-31 09:14:03 +00:00
e195d9d3a7 add SKIP_ROUTED_MOE debug flag, re-enable sinks 2026-05-31 07:02:38 +00:00
4f28673bec debug: disable sinks in SDPA to check |X| impact 2026-05-31 06:51:58 +00:00
e3db90b56c switch back to original prompt 2026-05-31 06:40:01 +00:00
d2cf5ccc32 CRITICAL FIX: use SDPA for short sequences (FMHA padding bug)
FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens),
the 123 padded zero-K entries contribute exp(0)=1 to the softmax
denominator, diluting real attention weights by ~128/5 = 25.6x.

This caused the model to produce incoherent output for short prompts.

Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer
sequences where the padding effect is negligible.

Also: SDPA path includes attention sinks (paper D5c), FMHA path
uses analytic sink correction via LSE.
2026-05-31 06:39:23 +00:00
5f98855141 test with simpler prompt 2026-05-31 06:28:45 +00:00
152af7295a debug: compare FMHA vs SDPA output at layer 0 2026-05-31 06:16:58 +00:00
59c75ca4e9 fix: cast attn_out back to BF16 after sink correction 2026-05-31 06:07:06 +00:00
e5245ea34e fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong 2026-05-31 06:03:13 +00:00
91abf0f921 FMHA + analytic sink bias correction using LSE
Instead of SDPA with virtual sink position, use the production FMHA
kernel and apply the sink bias as a post-hoc correction on the output.

The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink))

This simulates the attention sink (paper D5c) without modifying the
FMHA kernel. The sink absorbs some attention mass, reducing the
normalization constant and scaling down the output.
2026-05-31 05:58:01 +00:00
04dd7545b3 switch to production FMHA for full run 2026-05-31 04:51:16 +00:00
738088cf49 revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach 2026-05-31 04:51:10 +00:00
781ee43521 try separate K (RoPE'd) and V (raw) — no inverse RoPE needed 2026-05-31 04:46:14 +00:00
889521009b re-enable inverse RoPE (confirmed necessary — without it output is garbage) 2026-05-31 04:45:58 +00:00
92e465ca04 debug: disable inverse RoPE to check impact on output 2026-05-31 04:40:34 +00:00
c69dc51b3b switch to SDPA with sinks (better residual control) 2026-05-31 04:38:41 +00:00
3ed8f3cc44 switch back to production FMHA kernel (with FP4 LUT fix) 2026-05-31 04:32:01 +00:00
ae79bd8fce debug: add top-5 logit predictions 2026-05-31 04:25:01 +00:00
aafe2eee12 CRITICAL FIX: FP4 LUT was 4x too large!
E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24].
The old LUT was 4x the correct values, causing every NVFP4 dequantized
weight to be 4x too large. This compounded across 61 layers, causing
the residual stream to explode and producing gibberish output.

This is the root cause of the residual growth and incoherent generation.
2026-05-31 04:16:13 +00:00
b8c8da91fe fix: restore RoPE functions that were lost during mHC refactor 2026-05-31 04:10:51 +00:00
3f04a72af4 refactor: use production mHCLayer from dsv4.layers.mhc
Replace custom mHCBlock with wrapper around the tested production
mHCLayer class. This eliminates any bugs in my custom implementation
and uses the same code path that the model was designed for.

Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res
scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res
2026-05-31 04:06:58 +00:00
b519108cab fix: restore kv_cache.append that was accidentally removed 2026-05-31 03:56:58 +00:00
22a89b5a45 add attention sinks to SDPA path (paper D5c) 2026-05-31 03:52:59 +00:00
1905f19b8d fix: define q_input before USE_SDPA branch 2026-05-31 03:45:09 +00:00
cd073ad867 use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet) 2026-05-31 03:42:03 +00:00
171a9e0d10 disable diagnostics for clean production run 2026-05-31 03:32:17 +00:00
3f9b441428 diag: fix n_layers reference in forward_layer, add late-layer diags 2026-05-31 03:28:53 +00:00