Commit Graph

1879 Commits

Author SHA1 Message Date
aafe2eee12 CRITICAL FIX: FP4 LUT was 4x too large!
E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24].
The old LUT was 4x the correct values, causing every NVFP4 dequantized
weight to be 4x too large. This compounded across 61 layers, causing
the residual stream to explode and producing gibberish output.

This is the root cause of the residual growth and incoherent generation.
2026-05-31 04:16:13 +00:00
b8c8da91fe fix: restore RoPE functions that were lost during mHC refactor 2026-05-31 04:10:51 +00:00
3f04a72af4 refactor: use production mHCLayer from dsv4.layers.mhc
Replace custom mHCBlock with wrapper around the tested production
mHCLayer class. This eliminates any bugs in my custom implementation
and uses the same code path that the model was designed for.

Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res
scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res
2026-05-31 04:06:58 +00:00
b519108cab fix: restore kv_cache.append that was accidentally removed 2026-05-31 03:56:58 +00:00
22a89b5a45 add attention sinks to SDPA path (paper D5c) 2026-05-31 03:52:59 +00:00
1905f19b8d fix: define q_input before USE_SDPA branch 2026-05-31 03:45:09 +00:00
cd073ad867 use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet) 2026-05-31 03:42:03 +00:00
171a9e0d10 disable diagnostics for clean production run 2026-05-31 03:32:17 +00:00
3f9b441428 diag: fix n_layers reference in forward_layer, add late-layer diags 2026-05-31 03:28:53 +00:00
5b834a0599 diag: add late-layer diagnostics, fix ffn ctx variable 2026-05-31 03:25:55 +00:00
690c0a1121 CRITICAL FIX: mHC base/scale ordering was wrong
Checkpoint order is [pre, post, res] not [pre, res, post]:
- base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res
- scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res
- W_stacked rows: [W_pre(4), W_post(4), W_res(16)]
- Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24]

This was causing B_l to be near-identity and C_l to be near-2.0,
leading to exponential residual stream growth.
2026-05-31 03:16:07 +00:00
c3a2656c48 diag: add FFN and pre_block diagnostics 2026-05-31 03:12:52 +00:00
79ba7e6636 diag: add mHC diagnostics for first 3 layers 2026-05-31 03:10:05 +00:00
a262492e51 fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm 2026-05-31 03:04:53 +00:00
3f12bbc374 fix: move positions tensor to correct GPU for RoPE 2026-05-31 02:54:47 +00:00
0c3d168c60 single_shot: stream weights per-layer from CPU, fix KV/RoPE logic 2026-05-31 02:53:40 +00:00
61160ace13 fix: expert_weights/ids scoping in hash routing path 2026-05-31 02:50:32 +00:00
d772885d7e single_shot_inference: proper mHC+RMSNorm+inverse RoPE pipeline
Major rewrite of single_shot_inference.py:
- Replace broken mHC (gentle normalization hack) with proper Sinkhorn-Knopp
- Add RMSNorm before each sub-block (attention + FFN)
- Add inverse RoPE on attention output (paper §2.3.3)
- Fix KV cache: RoPE applied before caching, K=V in DSV4 MQA
- Fix MoE: proper dense routing with e_bias, SwiGLU clamping
- Proper weight mapping: fn→W_stacked, base→S_pre/S_res/S_post, scale→alphas
- Add identity mHC fallback when weights missing
- No emergency normalization, no bandaids
2026-05-31 02:45:52 +00:00
523b0e47b1 Add gentle RMSNorm: only clamps when values exceed unit norm 2026-05-31 00:31:34 +00:00
dcbb74841a Remove emergency RMSNorm from mHC post_block — MoE provides balance now 2026-05-31 00:27:48 +00:00
1de241ccfe Fix: add all_tokens tracking for decode loop 2026-05-31 00:22:08 +00:00
b1dd59293a Add prefill: process prompt tokens to fill KV cache before decoding 2026-05-31 00:18:55 +00:00
178fb5483a Fix KV cache: use index 0 (one-layer cache per layer instance) 2026-05-31 00:14:58 +00:00
afcc690ddc Add full MoE routing + KV cache to single_shot
MoE:
- Hash routing (first 3 layers): tid2eid lookup → 6 experts, uniform weights
- Dense routing (remaining): sqrt(softplus(gate)) → top-6 → renormalize
- 384 NVFP4 experts, each gate+up+down with SiGLU clamping
- Weighted combine × routed_scaling_factor + shared expert

KV cache:
- SimpleKVCache: BF16 flat (1, max_seq, hd) per layer
- Appends new K,V each decode step
- FMHA now attends over full cached sequence (not just current token)
- RoPE applied per-position on K cache

This should produce meaningful output — the model now has all
architectural components except proper mHC normalization.
2026-05-31 00:11:15 +00:00
3ecfbcba57 Fix T scope in post_block 2026-05-31 00:02:29 +00:00
a493f72681 Add per-residual RMSNorm in mHC post_block (routed MoE missing)
Without routed experts, F_out is always positive, causing unbounded
growth. Emergency RMSNorm on the residual keeps values bounded.
Remove once MoE is wired.
2026-05-30 23:59:19 +00:00
49282fe206 Fix mHC: match vLLM torch reference exactly
Key corrections:
- RMSNorm applied to projection output (mixes *= rsqrt(sqrsum/K + eps))
  not to the input before projection
- comb_mix uses softmax + Sinkhorn, NOT exp + Sinkhorn
- pre_mix = sigmoid(logits) + eps (not matmul with X_l)
- layer_input = sum(pre_mix * residual) — weighted sum, not bmm
- post_mix = sigmoid * hc_post_mult_value (2.0)
- bias split: [pre(4), post(4), comb(16)] not [pre(4), comb(16), post(4)]
2026-05-30 23:55:27 +00:00
66a66f8244 Add per-layer NaN tracking for mHC debug 2026-05-30 23:48:32 +00:00
d003c4b7cc Add mHC (Manifold-Constrained Hyper-Connections) to single_shot
- Full mHC pre_block/post_block with Sinkhorn-Knopp normalization
- Dynamic A_l (sigmoid), B_l (Birkhoff polytope), C_l (2*sigmoid)
- Checkpoint: attn_hc.fn (24,28672) + base (24,) + scale (3,)
- Two mHC blocks per layer: attn_hc + ffn_hc
- Removed emergency RMSNorm — mHC handles normalization properly
- X_l: (1, n_hc=4, H) residual state, init from embedding broadcast
2026-05-30 23:45:18 +00:00
f567c20539 Fix: set active CUDA device per layer for BMM/FMHA 2026-05-30 23:39:45 +00:00
7a95983e0f Rewrite single_shot: 8-GPU pipeline parallel
- Loads all 95 shards, assigns layers round-robin across 8 B200s
- ~8 layers per GPU, ~118GB weights per GPU (fits in 183GB)
- 3-phase pipeline: load weights → JIT compile → inference
- Activations move between GPUs at layer boundaries (NVLink)
- No streaming, no shard caching, no per-layer CPU loads
- Includes timing for each phase
2026-05-30 23:36:14 +00:00
aac0fa1f08 Update STATUS.md + MEMORY.md: single-shot inference verified 2026-05-30 22:59:27 +00:00
11c010e567 Update output section: kernel verified, architecture gaps noted 2026-05-30 22:58:49 +00:00
53178d2536 Add emergency RMSNorm after residuals (missing mHC fallback)
Without mHC, values explode to 761K after first layer.
Added per-residual RMSNorm + BF16 clamp to keep values bounded.
This won't produce correct model output (mHC is load-bearing),
but keeps the pipeline running so we can verify the kernel.
2026-05-30 22:56:16 +00:00
172ba75e0c Add per-layer NaN check to track where values diverge 2026-05-30 22:54:57 +00:00
ec7846e28c Add NaN tracking to single_shot_inference 2026-05-30 22:53:09 +00:00
5fa6c88b17 Fix: replace FP4 Inf with 24 (avoid NaN in dequant) 2026-05-30 22:51:10 +00:00
904753f62a Fix: BMM batch dim alignment for wo_a 2026-05-30 22:49:21 +00:00
52df3bc26c Fix: wo_a as batched matmul (grouped linear for output projection) 2026-05-30 22:48:31 +00:00
19240608d7 Fix: handle o_a_proj grouped linear shape mismatch 2026-05-30 22:46:12 +00:00
1d02758416 Fix: kv_proj outputs hd=512 (1 KV head MQA), Z from compressor.gate_proj 2026-05-30 22:45:14 +00:00
5dcfb333ea Fix: move weight tensors to CUDA before dequant 2026-05-30 22:43:47 +00:00
47c7b3c50b Fix: ensure FP4 LUT on CUDA before index op 2026-05-30 22:43:01 +00:00
13bae9dd55 Fix single_shot: mHC replaces layernorm, no hidden-level norm in DSV4 2026-05-30 22:42:17 +00:00
e8334fc4af Rewrite single_shot_inference.py — complete forward pass
- NVFP4 dequant with proper E2M1 LUT + E4M3 scale + global scale
- RoPE (GPT-J partial, last 64 dims)
- Q low-rank projection (q_a → q_b)
- KV projection (layer-type-aware: HCA/CSA/SWA)
- Production FMHA kernel (tcgen05 MMA)
- Output projection: o_a (BF16 grouped) → o_b (NVFP4)
- Shared expert FFN (gate/up/down, SiLU)
- RMSNorm for both attention and FFN
- Streaming weight loading (one layer at a time)
2026-05-30 22:40:56 +00:00
9b0858aa35 Add single_shot_inference.py — baseline kernel verification
Streams weights one layer at a time from 95 safetensors shards.
NVFP4 dequant → BF16 matmul for baseline (production uses tcgen05 MMA).
Runs token-by-token decode loop with production FMHA kernel.

Known gaps for first run:
- FFN (MoE) skipped — not the kernel under test
- mHC simplified — not the kernel under test
- RoPE skipped in baseline
- compressor/indexer bypassed (raw KV for now)

FMHA kernel is the component under test (cos ≥ 0.999993).
2026-05-30 22:39:01 +00:00
4472928506 E3: model construction test 2026-05-30 21:22:34 +00:00
afc07a5d1a Update STATUS.md: E5 done 2026-05-30 21:21:47 +00:00
df6220abaf E5: Fold batch loop into native kernel grid (blockIdx.z)
The 6-warp multi-tile kernel already supports batch natively via
dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input.
Single kernel launch per layer for batched decode instead of
batch_size launches.

T>1 prefill still uses per-batch dispatch (E8 future work).
2026-05-30 21:21:02 +00:00
e162a2d112 Update STATUS.md: E1-E4 done 2026-05-30 21:20:10 +00:00