input_scale is the activation quantization scale (for FP8 inputs).
Since we use BF16 activations, the weight dequant is simply:
lut[weight] * weight_scale * weight_scale_2
Folding input_scale in produced weights ~4000x too small,
causing all attention and FFN outputs to be effectively zero.
The model uses DeepseekV4HyperHead to project from the 4-stream mHC
residual to the final hidden state. Just taking stream 0 (X[:,0,:])
is WRONG — the hc_head learns how to combine the 4 streams.
Also:
- Remove --no-thinking mode (this is a reasoning model, it MUST think)
- Increase default max_tokens from 512 to 4096
- Load hc_head weights (fn, base, scale) from checkpoint
Compares forward_layer output with step-by-step PyTorch reference
to identify where residual blowup originates. Uses our own NVFP4
dequant — no HF dependency.
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
- Was applying Sinkhorn to post values and 2*sigmoid to comb values
- This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
- HF: comb.transpose(-1,-2) @ hidden_streams
- Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
- HF: softmax → col norm → (iters-1) alternating
- Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
- HF: sigmoid(...) + hc_eps
- Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
- Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
The HuggingFace reference treats attention sinks as a logit bias:
1. Compute raw Q*K scores
2. Concatenate sinks as a logit column
3. Softmax the combined logits
4. DROP the sink column (don't multiply by V)
5. Multiply by V
Our old code added sinks as a dummy zero-KV entry, which diluted
attention weights by adding an extra V=0 position to the softmax.
The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted
RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q
magnitudes are too large, causing attention scores to collapse to uniform
(entropy ~3.2 with 24 positions) and the model to produce garbage.
q_b_norm has no learnable parameters — just q / RMS(q).
This explains the nearly-uniform attention weights we've been seeing.
When enabled, bypasses mHC pre/post blocks and uses direct residual
connections with 0.1 scaling. This helps isolate whether the mHC
implementation is causing the garbage output.
The DSV4 Pro model uses rope_type='yarn' with factor=16. Our
build_rope_cache was using standard RoPE with theta=10000, completely
ignoring YaRN scaling. This produced wrong cos/sin values for all
positions, causing incorrect attention scores and garbage output.
YaRN modifies the RoPE frequencies:
- High-frequency components: unchanged
- Low-frequency components: scaled by 1/factor
- Medium: smooth interpolation
Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536
The model was trained with DeepSeek-specific chat tokens:
<|User|> (128803), <|Assistant|> (128804), <|EOT|> (128805)
Thinking: fi (128821), fl (128822)
Previous manual assembly just concatenated raw text without these tokens,
causing the model to not recognize user/assistant boundaries.
Format: <BOS><|User|>system prompt\n\nuser prompt<|Assistant|>
Previously, each prefill/decode token re-transferred ALL layer weights
from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made
prefill ~36s/token and caused the test to appear stuck.
Now: one-time cache_all_layer_weights() loads all 61 layers to their
target GPUs. Prefill should be ~1-2s/token instead of ~36s.
Also added flush=True to print statements so progress is visible.
All three mHC parameter tensors (fn, base, scale) share the same ordering
as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)].
Previous code loaded base as [pre(4), post(4), res(16)] and scale as
[alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and
alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to
be computed with wrong bias values, allowing the residual to explode.
Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums,
C_l values) to verify doubly-stochastic constraint is satisfied.
fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).
Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
first 448 always un-RoPE'd, relative encoding may be intentional