Commit Graph

1955 Commits

Author SHA1 Message Date
79d1a83348 Add NEXT_STEPS.md: post v0.1 issues, kernel migration plan, lessons learned 2026-05-31 22:30:34 +00:00
acc20dffd7 CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant
input_scale is the activation quantization scale (for FP8 inputs).
Since we use BF16 activations, the weight dequant is simply:
  lut[weight] * weight_scale * weight_scale_2

Folding input_scale in produced weights ~4000x too small,
causing all attention and FFN outputs to be effectively zero.
v0.1-e2e-working
2026-05-31 22:03:55 +00:00
4e64acbb64 fix MoE gate BF16/NVFP4 handling, add attention diagnostics 2026-05-31 21:57:47 +00:00
0d2b5ceb93 fix positions device mismatch: move to rope cache device in forward_attention 2026-05-31 21:54:56 +00:00
2676476013 fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16 2026-05-31 21:51:59 +00:00
eb08cd06d1 Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected
- Fixed weight key format: model.layers.{li}.self_attn.* (was layers.{li}.attn.*)
- Added NVFP4 two-level scale: weight_scale * weight_scale_2 * input_scale
- Proper CSA compressor: overlapping Ca/Cb streams, token-level softmax
- Proper HCA compressor: non-overlapping, single stream
- Indexer: NVFP4 q_b_proj + weights_proj + own compressor at index_head_dim
- Compressed KV (dim=hd) concatenated with SWA KV for attention
- Correct MoE key format: gate_proj/up_proj/down_proj
- Correct mHC key format: attn_hc.{fn,base,scale} and ffn_hc.{fn,base,scale}
- No more disconnected compressor — full E2E pipeline
2026-05-31 21:48:59 +00:00
4988e77179 probe key format 2026-05-31 21:42:52 +00:00
ba915dbd53 add probe_shapes script 2026-05-31 21:41:31 +00:00
c54dd15550 find hc keys 2026-05-31 21:38:43 +00:00
52b4971711 Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn)
- Fixed checkpoint key prefix: layers.{li}.attn.* and layers.{li}.ffn.*
  (was incorrectly model.layers.{li}.self_attn.* and .mlp.*)
- Added Compressor (CSA ratio=4 overlapping, HCA ratio=128)
- Added Indexer (CSA top-k selection)
- Compressor wkv/wgate are BF16 (NOT NVFP4 — no .scale)
- MoE gate is BF16 (not NVFP4)
- Added KV cache with SWA ring buffer + compressed entries
- Attention sinks as logit bias (paper D5c)
- YaRN RoPE with factor=16
- Proper mHC with Sinkhorn-Knopp
- HcHead for final mHC readout
- Still TODO: proper compressed KV attention (currently SWA-only)
2026-05-31 21:36:17 +00:00
cec17fee7d fixed prefix 2026-05-31 21:26:04 +00:00
696f3261ab focused key dump 2026-05-31 21:25:31 +00:00
b7c9bb1262 dump all keys 2026-05-31 21:24:58 +00:00
54e2a3684a filter expert keys 2026-05-31 21:24:35 +00:00
bafabda01f add checkpoint key dump script 2026-05-31 21:24:14 +00:00
23f1cf4065 Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference) 2026-05-31 21:13:21 +00:00
274ea13251 Fix critical bug: add hc_head for final mHC readout (was using stream 0)
The model uses DeepseekV4HyperHead to project from the 4-stream mHC
residual to the final hidden state. Just taking stream 0 (X[:,0,:])
is WRONG — the hc_head learns how to combine the 4 streams.

Also:
- Remove --no-thinking mode (this is a reasoning model, it MUST think)
- Increase default max_tokens from 512 to 4096
- Load hc_head weights (fn, base, scale) from checkpoint
2026-05-31 21:13:02 +00:00
baee36e728 Fix dtype mismatch in validate_layer: cast flat to float before F.linear 2026-05-31 20:23:18 +00:00
46c4ef2cf5 Add per-layer validation test (tests/validate_layer.py)
Compares forward_layer output with step-by-step PyTorch reference
to identify where residual blowup originates. Uses our own NVFP4
dequant — no HF dependency.
2026-05-31 20:22:13 +00:00
abe4210367 Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG 2026-05-31 20:21:03 +00:00
98fa410167 Add HF reference test script 2026-05-31 20:11:37 +00:00
a1b39adcaa Add attention entropy diag (ATTN_DIAG), KV cache diag, --no-thinking mode 2026-05-31 19:29:55 +00:00
2a886fe0f2 Add --no-thinking mode to skip thinking tokens and use second-best 2026-05-31 19:24:21 +00:00
41ef0ebd0f Add KV cache length diagnostic during decode 2026-05-31 19:17:24 +00:00
8baebf3c2e Restore --skip-mhc arg, empty system prompt for testing 2026-05-31 19:04:53 +00:00
ca661d32e8 Empty system prompt for testing (was causing model to regurgitate AI assistant tokens) 2026-05-31 19:03:55 +00:00
b09b2cf511 Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+
- Layers 0-2 use hash routing (tid2eid lookup, uniform weights)
- Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only)
- Fixed e_bias key name: e_score_correction_bias (not e_bias)
- Hash routing detection: check tid2eid present AND e_score_correction_bias absent
2026-05-31 18:52:38 +00:00
7d9e70c5d5 Fix remaining mHC API references: layer_compare.py, layer.py comment 2026-05-31 18:38:34 +00:00
7b123d159f CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
   - Was applying Sinkhorn to post values and 2*sigmoid to comb values
   - This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
   - HF: comb.transpose(-1,-2) @ hidden_streams
   - Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
   - HF: softmax → col norm → (iters-1) alternating
   - Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
   - HF: sigmoid(...) + hc_eps
   - Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
   - Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
2026-05-31 18:38:12 +00:00
f6c02f808f Add layer-by-layer comparison test for debugging 2026-05-31 12:48:43 +00:00
6ad577bd18 Add HuggingFace reference comparison test 2026-05-31 12:05:19 +00:00
581c4170f9 Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len) 2026-05-31 11:57:23 +00:00
0f951a0b1a Fix attention sinks: logit bias (HuggingFace reference), not dummy KV
The HuggingFace reference treats attention sinks as a logit bias:
1. Compute raw Q*K scores
2. Concatenate sinks as a logit column
3. Softmax the combined logits
4. DROP the sink column (don't multiply by V)
5. Multiply by V

Our old code added sinks as a dummy zero-KV entry, which diluted
attention weights by adding an extra V=0 position to the softmax.
2026-05-31 11:53:43 +00:00
daed594902 CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj)
The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted
RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q
magnitudes are too large, causing attention scores to collapse to uniform
(entropy ~3.2 with 24 positions) and the model to produce garbage.

q_b_norm has no learnable parameters — just q / RMS(q).
This explains the nearly-uniform attention weights we've been seeing.
2026-05-31 11:47:16 +00:00
dd50c355a6 Fix MHC_DIAG null check when SKIP_MHC is enabled 2026-05-31 11:37:32 +00:00
631e6ea3e4 Add --skip-mhc flag for simple residual diagnostic
When enabled, bypasses mHC pre/post blocks and uses direct residual
connections with 0.1 scaling. This helps isolate whether the mHC
implementation is causing the garbage output.
2026-05-31 11:33:41 +00:00
d201a9334e CRITICAL FIX: Add YaRN RoPE scaling (factor=16)
The DSV4 Pro model uses rope_type='yarn' with factor=16. Our
build_rope_cache was using standard RoPE with theta=10000, completely
ignoring YaRN scaling. This produced wrong cos/sin values for all
positions, causing incorrect attention scores and garbage output.

YaRN modifies the RoPE frequencies:
- High-frequency components: unchanged
- Low-frequency components: scaled by 1/factor
- Medium: smooth interpolation

Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536
2026-05-31 11:25:52 +00:00
88719f39b4 Add single-layer trace (Phase 2.6) for detailed debugging 2026-05-31 11:20:46 +00:00
8256e23aed Fix mHCContext attribute access (not tuple unpacking) and enable attention diag 2026-05-31 11:10:37 +00:00
72c139a59f Enable MHC_DIAG for diagnostic run 2026-05-31 11:07:23 +00:00
cd661c2e40 Add attention and Q/KV diagnostics (MHC_DIAG flag) 2026-05-31 11:07:17 +00:00
9584fcbc23 Fix top5_ids variable name in decode logging 2026-05-31 10:54:40 +00:00
a6d56d10ca Add top-20 logging and thinking token detection in decode loop 2026-05-31 10:49:28 +00:00
d891ae7e96 Fix prompt format: use DeepSeek V4 chat tokens
The model was trained with DeepSeek-specific chat tokens:
  <|User|> (128803), <|Assistant|> (128804), <|EOT|> (128805)
  Thinking: fi (128821), fl (128822)

Previous manual assembly just concatenated raw text without these tokens,
causing the model to not recognize user/assistant boundaries.

Format: <BOS><|User|>system prompt\n\nuser prompt<|Assistant|>
2026-05-31 10:33:41 +00:00
f86742ef8e Cache layer weights on GPU — eliminates per-token CPU→GPU transfer
Previously, each prefill/decode token re-transferred ALL layer weights
from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made
prefill ~36s/token and caused the test to appear stuck.

Now: one-time cache_all_layer_weights() loads all 61 layers to their
target GPUs. Prefill should be ~1-2s/token instead of ~36s.

Also added flush=True to print statements so progress is visible.
2026-05-31 10:28:25 +00:00
ce3d6069cc CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post]
All three mHC parameter tensors (fn, base, scale) share the same ordering
as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)].

Previous code loaded base as [pre(4), post(4), res(16)] and scale as
[alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and
alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to
be computed with wrong bias values, allowing the residual to explode.

Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums,
C_l values) to verify doubly-stochastic constraint is satisfied.
2026-05-31 10:07:14 +00:00
9a43e9aa77 CRITICAL FIX: mHC fn weight row ordering was wrong
fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).

Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong:   fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
2026-05-31 10:02:57 +00:00
0346e479d4 Add system prompt, CLI args, inverse RoPE flag, minimal e2e test
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
  tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
  first 448 always un-RoPE'd, relative encoding may be intentional
2026-05-31 09:56:18 +00:00
429fc3db40 Fix expert weight indexing for 1D tensor 2026-05-31 09:23:10 +00:00
33004dcbf4 Fix expert weight broadcasting (wt.item() for scalar multiply) 2026-05-31 09:22:27 +00:00