Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
- Was applying Sinkhorn to post values and 2*sigmoid to comb values
- This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
- HF: comb.transpose(-1,-2) @ hidden_streams
- Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
- HF: softmax → col norm → (iters-1) alternating
- Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
- HF: sigmoid(...) + hc_eps
- Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
- Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
The HuggingFace reference treats attention sinks as a logit bias:
1. Compute raw Q*K scores
2. Concatenate sinks as a logit column
3. Softmax the combined logits
4. DROP the sink column (don't multiply by V)
5. Multiply by V
Our old code added sinks as a dummy zero-KV entry, which diluted
attention weights by adding an extra V=0 position to the softmax.
The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted
RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q
magnitudes are too large, causing attention scores to collapse to uniform
(entropy ~3.2 with 24 positions) and the model to produce garbage.
q_b_norm has no learnable parameters — just q / RMS(q).
This explains the nearly-uniform attention weights we've been seeing.
When enabled, bypasses mHC pre/post blocks and uses direct residual
connections with 0.1 scaling. This helps isolate whether the mHC
implementation is causing the garbage output.
The DSV4 Pro model uses rope_type='yarn' with factor=16. Our
build_rope_cache was using standard RoPE with theta=10000, completely
ignoring YaRN scaling. This produced wrong cos/sin values for all
positions, causing incorrect attention scores and garbage output.
YaRN modifies the RoPE frequencies:
- High-frequency components: unchanged
- Low-frequency components: scaled by 1/factor
- Medium: smooth interpolation
Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536
The model was trained with DeepSeek-specific chat tokens:
<|User|> (128803), <|Assistant|> (128804), <|EOT|> (128805)
Thinking: fi (128821), fl (128822)
Previous manual assembly just concatenated raw text without these tokens,
causing the model to not recognize user/assistant boundaries.
Format: <BOS><|User|>system prompt\n\nuser prompt<|Assistant|>
Previously, each prefill/decode token re-transferred ALL layer weights
from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made
prefill ~36s/token and caused the test to appear stuck.
Now: one-time cache_all_layer_weights() loads all 61 layers to their
target GPUs. Prefill should be ~1-2s/token instead of ~36s.
Also added flush=True to print statements so progress is visible.
All three mHC parameter tensors (fn, base, scale) share the same ordering
as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)].
Previous code loaded base as [pre(4), post(4), res(16)] and scale as
[alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and
alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to
be computed with wrong bias values, allowing the residual to explode.
Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums,
C_l values) to verify doubly-stochastic constraint is satisfied.
fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).
Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
first 448 always un-RoPE'd, relative encoding may be intentional
FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens),
the 123 padded zero-K entries contribute exp(0)=1 to the softmax
denominator, diluting real attention weights by ~128/5 = 25.6x.
This caused the model to produce incoherent output for short prompts.
Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer
sequences where the padding effect is negligible.
Also: SDPA path includes attention sinks (paper D5c), FMHA path
uses analytic sink correction via LSE.
Instead of SDPA with virtual sink position, use the production FMHA
kernel and apply the sink bias as a post-hoc correction on the output.
The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink))
This simulates the attention sink (paper D5c) without modifying the
FMHA kernel. The sink absorbs some attention mass, reducing the
normalization constant and scaling down the output.