Commit Graph

1915 Commits

Author SHA1 Message Date
cd661c2e40 Add attention and Q/KV diagnostics (MHC_DIAG flag) 2026-05-31 11:07:17 +00:00
9584fcbc23 Fix top5_ids variable name in decode logging 2026-05-31 10:54:40 +00:00
a6d56d10ca Add top-20 logging and thinking token detection in decode loop 2026-05-31 10:49:28 +00:00
d891ae7e96 Fix prompt format: use DeepSeek V4 chat tokens
The model was trained with DeepSeek-specific chat tokens:
  <|User|> (128803), <|Assistant|> (128804), <|EOT|> (128805)
  Thinking: fi (128821), fl (128822)

Previous manual assembly just concatenated raw text without these tokens,
causing the model to not recognize user/assistant boundaries.

Format: <BOS><|User|>system prompt\n\nuser prompt<|Assistant|>
2026-05-31 10:33:41 +00:00
f86742ef8e Cache layer weights on GPU — eliminates per-token CPU→GPU transfer
Previously, each prefill/decode token re-transferred ALL layer weights
from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made
prefill ~36s/token and caused the test to appear stuck.

Now: one-time cache_all_layer_weights() loads all 61 layers to their
target GPUs. Prefill should be ~1-2s/token instead of ~36s.

Also added flush=True to print statements so progress is visible.
2026-05-31 10:28:25 +00:00
ce3d6069cc CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post]
All three mHC parameter tensors (fn, base, scale) share the same ordering
as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)].

Previous code loaded base as [pre(4), post(4), res(16)] and scale as
[alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and
alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to
be computed with wrong bias values, allowing the residual to explode.

Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums,
C_l values) to verify doubly-stochastic constraint is satisfied.
2026-05-31 10:07:14 +00:00
9a43e9aa77 CRITICAL FIX: mHC fn weight row ordering was wrong
fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).

Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong:   fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
2026-05-31 10:02:57 +00:00
0346e479d4 Add system prompt, CLI args, inverse RoPE flag, minimal e2e test
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
  tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
  first 448 always un-RoPE'd, relative encoding may be intentional
2026-05-31 09:56:18 +00:00
429fc3db40 Fix expert weight indexing for 1D tensor 2026-05-31 09:23:10 +00:00
33004dcbf4 Fix expert weight broadcasting (wt.item() for scalar multiply) 2026-05-31 09:22:27 +00:00
1434b35971 Add residual diagnostic test — per-layer magnitude tracking 2026-05-31 09:21:41 +00:00
1c18c16c68 Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16 2026-05-31 09:17:36 +00:00
970869d017 Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected) 2026-05-31 09:17:07 +00:00
a2ee78b564 Fix RoPE shape bug (interleave needs separate even/odd assembly) 2026-05-31 09:15:59 +00:00
9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip
BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16).
This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating
across 61 layers into garbage output. FP32 cache + FP32 arithmetic
gives exact round-trip (diff < 1e-7).

Also fixes: MoE expert loop indentation (was only running last expert).
2026-05-31 09:14:59 +00:00
db74a887ab Add minimal e2e test + fix MoE expert loop bug (indentation) 2026-05-31 09:14:03 +00:00
e195d9d3a7 add SKIP_ROUTED_MOE debug flag, re-enable sinks 2026-05-31 07:02:38 +00:00
4f28673bec debug: disable sinks in SDPA to check |X| impact 2026-05-31 06:51:58 +00:00
e3db90b56c switch back to original prompt 2026-05-31 06:40:01 +00:00
d2cf5ccc32 CRITICAL FIX: use SDPA for short sequences (FMHA padding bug)
FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens),
the 123 padded zero-K entries contribute exp(0)=1 to the softmax
denominator, diluting real attention weights by ~128/5 = 25.6x.

This caused the model to produce incoherent output for short prompts.

Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer
sequences where the padding effect is negligible.

Also: SDPA path includes attention sinks (paper D5c), FMHA path
uses analytic sink correction via LSE.
2026-05-31 06:39:23 +00:00
5f98855141 test with simpler prompt 2026-05-31 06:28:45 +00:00
152af7295a debug: compare FMHA vs SDPA output at layer 0 2026-05-31 06:16:58 +00:00
59c75ca4e9 fix: cast attn_out back to BF16 after sink correction 2026-05-31 06:07:06 +00:00
e5245ea34e fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong 2026-05-31 06:03:13 +00:00
91abf0f921 FMHA + analytic sink bias correction using LSE
Instead of SDPA with virtual sink position, use the production FMHA
kernel and apply the sink bias as a post-hoc correction on the output.

The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink))

This simulates the attention sink (paper D5c) without modifying the
FMHA kernel. The sink absorbs some attention mass, reducing the
normalization constant and scaling down the output.
2026-05-31 05:58:01 +00:00
fac269c938 fix verify_attention: proper multi-head SDPA + GQA 2026-05-31 05:55:10 +00:00
2333fc8b4b fix verify_attention.py: proper nvfp4_linear calls 2026-05-31 05:53:49 +00:00
c09f68c867 add verify_attention.py: single-layer attention component test 2026-05-31 05:51:36 +00:00
04dd7545b3 switch to production FMHA for full run 2026-05-31 04:51:16 +00:00
738088cf49 revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach 2026-05-31 04:51:10 +00:00
781ee43521 try separate K (RoPE'd) and V (raw) — no inverse RoPE needed 2026-05-31 04:46:14 +00:00
889521009b re-enable inverse RoPE (confirmed necessary — without it output is garbage) 2026-05-31 04:45:58 +00:00
92e465ca04 debug: disable inverse RoPE to check impact on output 2026-05-31 04:40:34 +00:00
c69dc51b3b switch to SDPA with sinks (better residual control) 2026-05-31 04:38:41 +00:00
3ed8f3cc44 switch back to production FMHA kernel (with FP4 LUT fix) 2026-05-31 04:32:01 +00:00
ae79bd8fce debug: add top-5 logit predictions 2026-05-31 04:25:01 +00:00
aafe2eee12 CRITICAL FIX: FP4 LUT was 4x too large!
E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24].
The old LUT was 4x the correct values, causing every NVFP4 dequantized
weight to be 4x too large. This compounded across 61 layers, causing
the residual stream to explode and producing gibberish output.

This is the root cause of the residual growth and incoherent generation.
2026-05-31 04:16:13 +00:00
b8c8da91fe fix: restore RoPE functions that were lost during mHC refactor 2026-05-31 04:10:51 +00:00
3f04a72af4 refactor: use production mHCLayer from dsv4.layers.mhc
Replace custom mHCBlock with wrapper around the tested production
mHCLayer class. This eliminates any bugs in my custom implementation
and uses the same code path that the model was designed for.

Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res
scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res
2026-05-31 04:06:58 +00:00
b519108cab fix: restore kv_cache.append that was accidentally removed 2026-05-31 03:56:58 +00:00
22a89b5a45 add attention sinks to SDPA path (paper D5c) 2026-05-31 03:52:59 +00:00
1905f19b8d fix: define q_input before USE_SDPA branch 2026-05-31 03:45:09 +00:00
cd073ad867 use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet) 2026-05-31 03:42:03 +00:00
171a9e0d10 disable diagnostics for clean production run 2026-05-31 03:32:17 +00:00
3f9b441428 diag: fix n_layers reference in forward_layer, add late-layer diags 2026-05-31 03:28:53 +00:00
5b834a0599 diag: add late-layer diagnostics, fix ffn ctx variable 2026-05-31 03:25:55 +00:00
690c0a1121 CRITICAL FIX: mHC base/scale ordering was wrong
Checkpoint order is [pre, post, res] not [pre, res, post]:
- base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res
- scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res
- W_stacked rows: [W_pre(4), W_post(4), W_res(16)]
- Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24]

This was causing B_l to be near-identity and C_l to be near-2.0,
leading to exponential residual stream growth.
2026-05-31 03:16:07 +00:00
c3a2656c48 diag: add FFN and pre_block diagnostics 2026-05-31 03:12:52 +00:00
79ba7e6636 diag: add mHC diagnostics for first 3 layers 2026-05-31 03:10:05 +00:00
a262492e51 fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm 2026-05-31 03:04:53 +00:00