Commit Graph

1893 Commits

Author SHA1 Message Date
59c75ca4e9 fix: cast attn_out back to BF16 after sink correction 2026-05-31 06:07:06 +00:00
e5245ea34e fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong 2026-05-31 06:03:13 +00:00
91abf0f921 FMHA + analytic sink bias correction using LSE
Instead of SDPA with virtual sink position, use the production FMHA
kernel and apply the sink bias as a post-hoc correction on the output.

The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink))

This simulates the attention sink (paper D5c) without modifying the
FMHA kernel. The sink absorbs some attention mass, reducing the
normalization constant and scaling down the output.
2026-05-31 05:58:01 +00:00
fac269c938 fix verify_attention: proper multi-head SDPA + GQA 2026-05-31 05:55:10 +00:00
2333fc8b4b fix verify_attention.py: proper nvfp4_linear calls 2026-05-31 05:53:49 +00:00
c09f68c867 add verify_attention.py: single-layer attention component test 2026-05-31 05:51:36 +00:00
04dd7545b3 switch to production FMHA for full run 2026-05-31 04:51:16 +00:00
738088cf49 revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach 2026-05-31 04:51:10 +00:00
781ee43521 try separate K (RoPE'd) and V (raw) — no inverse RoPE needed 2026-05-31 04:46:14 +00:00
889521009b re-enable inverse RoPE (confirmed necessary — without it output is garbage) 2026-05-31 04:45:58 +00:00
92e465ca04 debug: disable inverse RoPE to check impact on output 2026-05-31 04:40:34 +00:00
c69dc51b3b switch to SDPA with sinks (better residual control) 2026-05-31 04:38:41 +00:00
3ed8f3cc44 switch back to production FMHA kernel (with FP4 LUT fix) 2026-05-31 04:32:01 +00:00
ae79bd8fce debug: add top-5 logit predictions 2026-05-31 04:25:01 +00:00
aafe2eee12 CRITICAL FIX: FP4 LUT was 4x too large!
E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24].
The old LUT was 4x the correct values, causing every NVFP4 dequantized
weight to be 4x too large. This compounded across 61 layers, causing
the residual stream to explode and producing gibberish output.

This is the root cause of the residual growth and incoherent generation.
2026-05-31 04:16:13 +00:00
b8c8da91fe fix: restore RoPE functions that were lost during mHC refactor 2026-05-31 04:10:51 +00:00
3f04a72af4 refactor: use production mHCLayer from dsv4.layers.mhc
Replace custom mHCBlock with wrapper around the tested production
mHCLayer class. This eliminates any bugs in my custom implementation
and uses the same code path that the model was designed for.

Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res
scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res
2026-05-31 04:06:58 +00:00
b519108cab fix: restore kv_cache.append that was accidentally removed 2026-05-31 03:56:58 +00:00
22a89b5a45 add attention sinks to SDPA path (paper D5c) 2026-05-31 03:52:59 +00:00
1905f19b8d fix: define q_input before USE_SDPA branch 2026-05-31 03:45:09 +00:00
cd073ad867 use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet) 2026-05-31 03:42:03 +00:00
171a9e0d10 disable diagnostics for clean production run 2026-05-31 03:32:17 +00:00
3f9b441428 diag: fix n_layers reference in forward_layer, add late-layer diags 2026-05-31 03:28:53 +00:00
5b834a0599 diag: add late-layer diagnostics, fix ffn ctx variable 2026-05-31 03:25:55 +00:00
690c0a1121 CRITICAL FIX: mHC base/scale ordering was wrong
Checkpoint order is [pre, post, res] not [pre, res, post]:
- base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res
- scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res
- W_stacked rows: [W_pre(4), W_post(4), W_res(16)]
- Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24]

This was causing B_l to be near-identity and C_l to be near-2.0,
leading to exponential residual stream growth.
2026-05-31 03:16:07 +00:00
c3a2656c48 diag: add FFN and pre_block diagnostics 2026-05-31 03:12:52 +00:00
79ba7e6636 diag: add mHC diagnostics for first 3 layers 2026-05-31 03:10:05 +00:00
a262492e51 fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm 2026-05-31 03:04:53 +00:00
3f12bbc374 fix: move positions tensor to correct GPU for RoPE 2026-05-31 02:54:47 +00:00
0c3d168c60 single_shot: stream weights per-layer from CPU, fix KV/RoPE logic 2026-05-31 02:53:40 +00:00
61160ace13 fix: expert_weights/ids scoping in hash routing path 2026-05-31 02:50:32 +00:00
d772885d7e single_shot_inference: proper mHC+RMSNorm+inverse RoPE pipeline
Major rewrite of single_shot_inference.py:
- Replace broken mHC (gentle normalization hack) with proper Sinkhorn-Knopp
- Add RMSNorm before each sub-block (attention + FFN)
- Add inverse RoPE on attention output (paper §2.3.3)
- Fix KV cache: RoPE applied before caching, K=V in DSV4 MQA
- Fix MoE: proper dense routing with e_bias, SwiGLU clamping
- Proper weight mapping: fn→W_stacked, base→S_pre/S_res/S_post, scale→alphas
- Add identity mHC fallback when weights missing
- No emergency normalization, no bandaids
2026-05-31 02:45:52 +00:00
523b0e47b1 Add gentle RMSNorm: only clamps when values exceed unit norm 2026-05-31 00:31:34 +00:00
dcbb74841a Remove emergency RMSNorm from mHC post_block — MoE provides balance now 2026-05-31 00:27:48 +00:00
1de241ccfe Fix: add all_tokens tracking for decode loop 2026-05-31 00:22:08 +00:00
b1dd59293a Add prefill: process prompt tokens to fill KV cache before decoding 2026-05-31 00:18:55 +00:00
178fb5483a Fix KV cache: use index 0 (one-layer cache per layer instance) 2026-05-31 00:14:58 +00:00
afcc690ddc Add full MoE routing + KV cache to single_shot
MoE:
- Hash routing (first 3 layers): tid2eid lookup → 6 experts, uniform weights
- Dense routing (remaining): sqrt(softplus(gate)) → top-6 → renormalize
- 384 NVFP4 experts, each gate+up+down with SiGLU clamping
- Weighted combine × routed_scaling_factor + shared expert

KV cache:
- SimpleKVCache: BF16 flat (1, max_seq, hd) per layer
- Appends new K,V each decode step
- FMHA now attends over full cached sequence (not just current token)
- RoPE applied per-position on K cache

This should produce meaningful output — the model now has all
architectural components except proper mHC normalization.
2026-05-31 00:11:15 +00:00
3ecfbcba57 Fix T scope in post_block 2026-05-31 00:02:29 +00:00
a493f72681 Add per-residual RMSNorm in mHC post_block (routed MoE missing)
Without routed experts, F_out is always positive, causing unbounded
growth. Emergency RMSNorm on the residual keeps values bounded.
Remove once MoE is wired.
2026-05-30 23:59:19 +00:00
49282fe206 Fix mHC: match vLLM torch reference exactly
Key corrections:
- RMSNorm applied to projection output (mixes *= rsqrt(sqrsum/K + eps))
  not to the input before projection
- comb_mix uses softmax + Sinkhorn, NOT exp + Sinkhorn
- pre_mix = sigmoid(logits) + eps (not matmul with X_l)
- layer_input = sum(pre_mix * residual) — weighted sum, not bmm
- post_mix = sigmoid * hc_post_mult_value (2.0)
- bias split: [pre(4), post(4), comb(16)] not [pre(4), comb(16), post(4)]
2026-05-30 23:55:27 +00:00
66a66f8244 Add per-layer NaN tracking for mHC debug 2026-05-30 23:48:32 +00:00
d003c4b7cc Add mHC (Manifold-Constrained Hyper-Connections) to single_shot
- Full mHC pre_block/post_block with Sinkhorn-Knopp normalization
- Dynamic A_l (sigmoid), B_l (Birkhoff polytope), C_l (2*sigmoid)
- Checkpoint: attn_hc.fn (24,28672) + base (24,) + scale (3,)
- Two mHC blocks per layer: attn_hc + ffn_hc
- Removed emergency RMSNorm — mHC handles normalization properly
- X_l: (1, n_hc=4, H) residual state, init from embedding broadcast
2026-05-30 23:45:18 +00:00
f567c20539 Fix: set active CUDA device per layer for BMM/FMHA 2026-05-30 23:39:45 +00:00
7a95983e0f Rewrite single_shot: 8-GPU pipeline parallel
- Loads all 95 shards, assigns layers round-robin across 8 B200s
- ~8 layers per GPU, ~118GB weights per GPU (fits in 183GB)
- 3-phase pipeline: load weights → JIT compile → inference
- Activations move between GPUs at layer boundaries (NVLink)
- No streaming, no shard caching, no per-layer CPU loads
- Includes timing for each phase
2026-05-30 23:36:14 +00:00
aac0fa1f08 Update STATUS.md + MEMORY.md: single-shot inference verified 2026-05-30 22:59:27 +00:00
11c010e567 Update output section: kernel verified, architecture gaps noted 2026-05-30 22:58:49 +00:00
53178d2536 Add emergency RMSNorm after residuals (missing mHC fallback)
Without mHC, values explode to 761K after first layer.
Added per-residual RMSNorm + BF16 clamp to keep values bounded.
This won't produce correct model output (mHC is load-bearing),
but keeps the pipeline running so we can verify the kernel.
2026-05-30 22:56:16 +00:00
172ba75e0c Add per-layer NaN check to track where values diverge 2026-05-30 22:54:57 +00:00
ec7846e28c Add NaN tracking to single_shot_inference 2026-05-30 22:53:09 +00:00