Commit Graph

113 Commits

Author SHA1 Message Date
a6a8755439 single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128) 2026-05-31 23:33:32 +00:00
80002f2efc single_shot: production NVFP4 GEMM for ALL attention projections
- Nvfp4Linear (CuTeDSL) for q_a, q_b, kv, o_b — NO more dequant+matmul
- Production FMHA (6-warp TMA multi-tile) with per-head sink bias
- Production MoE + Router + SharedExpert + mHC (unchanged)
- wo_a still uses BF16 grouped BMM (checkpoint is BF16)
- Compressor/Indexer still PyTorch ref (not yet on tensor cores)
- Proper weight dimensions: q_a(7168->1536), q_b(1536->65536), kv(7168->512), o_b(16384->7168)
2026-05-31 23:28:16 +00:00
32efd5139d Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E) 2026-05-31 23:21:09 +00:00
e45c0ff51b single_shot: use reference dequant for attn projections, focus on MoE+FMHA
Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT
triggered by _ensure_initialized). Disable for now to validate
the critical paths first:
- Production FMHA with sink bias
- Production MoE (Nvfp4MoE + Nvfp4SharedExpert)
- Production Router (dense/hash)
- Production mHC

Attention projections use reference dequant+matmul for now.
Will re-enable Nvfp4Linear after validating MoE path.
2026-05-31 23:20:04 +00:00
dfbffa1df1 single_shot: CUDA_LAUNCH_BLOCKING for debugging 2026-05-31 23:18:35 +00:00
a66fdf6049 single_shot: add sync to catch CUDA errors early 2026-05-31 23:17:46 +00:00
0b35c36d23 single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init
- MoE expert weights loaded per-expert to GPU (no huge CPU tensors)
- Nvfp4Linear finalize_weights deferred (lazy on first forward)
- Shared expert weights loaded directly to GPU
- Added GPU cache cleanup at start
- Fixed shared expert finalize_weights (now lazy)
2026-05-31 23:16:45 +00:00
050b5ee449 Fix n_h reference before assignment in single_shot 2026-05-31 23:14:24 +00:00
13be3ad443 FMHA sink bias in kernel + single_shot production rewrite
FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
  new_max = max(running_max, sink_bias * scale)
  rescale existing O_unnorm and running_sum
  running_sum += exp(sink_bias * scale - new_max)
  No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor

single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths
2026-05-31 23:10:13 +00:00
23e88638aa single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer)
Build stacked (E, N, K) tensors incrementally on CPU, then move to GPU
in one shot. Avoids holding 384 individual expert weight+scale tensors
on GPU simultaneously (~3x memory savings per layer).
2026-05-31 22:55:11 +00:00
92200367f3 FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128
ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment
but then passed the PADDED N to the kernel as s_k (logical KV length).
This told the kernel all 128 entries were valid, so softmax ran over
zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128).

FIX: Pass N_orig (true sequence length) as s_k for softmax masking,
and N_padded (physical size) only for TMA descriptor creation.
The kernel's existing col < kv_len guard correctly excludes padded
entries from row_max and exp_sum calculations.

Files changed:
- fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for
  params.s_k and N_padded for TMA descriptors
- fmha_multitile_op.py: pass N_orig and N_padded separately
- single_shot_inference.py: removed SDPA fallback (kernel now correct)
2026-05-31 22:52:39 +00:00
d40821c843 single_shot: fix memory (no double-loading MoE weights), FMHA short-seq fallback
- Don't cache MoE/SE expert weights in layer_w (handled by runners)
  This saves ~10.6GB/layer × 61 = ~647GB of double-loaded GPU memory
- Add FMHA fallback for seq_len < 128 (known kernel limitation:
  zero-padding dilutes softmax). TODO: fix kernel to mask padded entries.
- Free all_w and empty GPU caches after building runners
2026-05-31 22:49:15 +00:00
91568e12d4 single_shot_inference.py: production kernel stack version
- FMHA: 6-warp TMA multi-tile kernel via dsv4_attention
- MoE: Nvfp4MoE (CuTeDSL NVFP4 grouped GEMM, fused SwiGLU)
- Shared expert: Nvfp4SharedExpert (CuTeDSL NVFP4 single-group GEMM)
- Router: production dense/hash router kernels
- Compressor: CSA/HCA token-level softmax
- Indexer: score+topk
- mHC: Sinkhorn-Knopp, B_l transposed, [pre,post,comb]
- No PyTorch SDPA, no F.linear for kernel paths
- Falls back to dequant BF16 only if production kernels fail
- FP32 RoPE cache (BF16 destroys cos²+sin²=1)
2026-05-31 22:45:44 +00:00
fb96c34b89 rename: single_shot_inference.py → single_shot_PYTORCH_REFERENCE.py 2026-05-31 22:42:06 +00:00
acc20dffd7 CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant
input_scale is the activation quantization scale (for FP8 inputs).
Since we use BF16 activations, the weight dequant is simply:
  lut[weight] * weight_scale * weight_scale_2

Folding input_scale in produced weights ~4000x too small,
causing all attention and FFN outputs to be effectively zero.
2026-05-31 22:03:55 +00:00
4e64acbb64 fix MoE gate BF16/NVFP4 handling, add attention diagnostics 2026-05-31 21:57:47 +00:00
0d2b5ceb93 fix positions device mismatch: move to rope cache device in forward_attention 2026-05-31 21:54:56 +00:00
2676476013 fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16 2026-05-31 21:51:59 +00:00
eb08cd06d1 Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected
- Fixed weight key format: model.layers.{li}.self_attn.* (was layers.{li}.attn.*)
- Added NVFP4 two-level scale: weight_scale * weight_scale_2 * input_scale
- Proper CSA compressor: overlapping Ca/Cb streams, token-level softmax
- Proper HCA compressor: non-overlapping, single stream
- Indexer: NVFP4 q_b_proj + weights_proj + own compressor at index_head_dim
- Compressed KV (dim=hd) concatenated with SWA KV for attention
- Correct MoE key format: gate_proj/up_proj/down_proj
- Correct mHC key format: attn_hc.{fn,base,scale} and ffn_hc.{fn,base,scale}
- No more disconnected compressor — full E2E pipeline
2026-05-31 21:48:59 +00:00
52b4971711 Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn)
- Fixed checkpoint key prefix: layers.{li}.attn.* and layers.{li}.ffn.*
  (was incorrectly model.layers.{li}.self_attn.* and .mlp.*)
- Added Compressor (CSA ratio=4 overlapping, HCA ratio=128)
- Added Indexer (CSA top-k selection)
- Compressor wkv/wgate are BF16 (NOT NVFP4 — no .scale)
- MoE gate is BF16 (not NVFP4)
- Added KV cache with SWA ring buffer + compressed entries
- Attention sinks as logit bias (paper D5c)
- YaRN RoPE with factor=16
- Proper mHC with Sinkhorn-Knopp
- HcHead for final mHC readout
- Still TODO: proper compressed KV attention (currently SWA-only)
2026-05-31 21:36:17 +00:00
23f1cf4065 Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference) 2026-05-31 21:13:21 +00:00
274ea13251 Fix critical bug: add hc_head for final mHC readout (was using stream 0)
The model uses DeepseekV4HyperHead to project from the 4-stream mHC
residual to the final hidden state. Just taking stream 0 (X[:,0,:])
is WRONG — the hc_head learns how to combine the 4 streams.

Also:
- Remove --no-thinking mode (this is a reasoning model, it MUST think)
- Increase default max_tokens from 512 to 4096
- Load hc_head weights (fn, base, scale) from checkpoint
2026-05-31 21:13:02 +00:00
abe4210367 Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG 2026-05-31 20:21:03 +00:00
a1b39adcaa Add attention entropy diag (ATTN_DIAG), KV cache diag, --no-thinking mode 2026-05-31 19:29:55 +00:00
2a886fe0f2 Add --no-thinking mode to skip thinking tokens and use second-best 2026-05-31 19:24:21 +00:00
41ef0ebd0f Add KV cache length diagnostic during decode 2026-05-31 19:17:24 +00:00
8baebf3c2e Restore --skip-mhc arg, empty system prompt for testing 2026-05-31 19:04:53 +00:00
ca661d32e8 Empty system prompt for testing (was causing model to regurgitate AI assistant tokens) 2026-05-31 19:03:55 +00:00
b09b2cf511 Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+
- Layers 0-2 use hash routing (tid2eid lookup, uniform weights)
- Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only)
- Fixed e_bias key name: e_score_correction_bias (not e_bias)
- Hash routing detection: check tid2eid present AND e_score_correction_bias absent
2026-05-31 18:52:38 +00:00
7b123d159f CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
   - Was applying Sinkhorn to post values and 2*sigmoid to comb values
   - This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
   - HF: comb.transpose(-1,-2) @ hidden_streams
   - Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
   - HF: softmax → col norm → (iters-1) alternating
   - Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
   - HF: sigmoid(...) + hc_eps
   - Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
   - Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
2026-05-31 18:38:12 +00:00
581c4170f9 Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len) 2026-05-31 11:57:23 +00:00
0f951a0b1a Fix attention sinks: logit bias (HuggingFace reference), not dummy KV
The HuggingFace reference treats attention sinks as a logit bias:
1. Compute raw Q*K scores
2. Concatenate sinks as a logit column
3. Softmax the combined logits
4. DROP the sink column (don't multiply by V)
5. Multiply by V

Our old code added sinks as a dummy zero-KV entry, which diluted
attention weights by adding an extra V=0 position to the softmax.
2026-05-31 11:53:43 +00:00
daed594902 CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj)
The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted
RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q
magnitudes are too large, causing attention scores to collapse to uniform
(entropy ~3.2 with 24 positions) and the model to produce garbage.

q_b_norm has no learnable parameters — just q / RMS(q).
This explains the nearly-uniform attention weights we've been seeing.
2026-05-31 11:47:16 +00:00
dd50c355a6 Fix MHC_DIAG null check when SKIP_MHC is enabled 2026-05-31 11:37:32 +00:00
631e6ea3e4 Add --skip-mhc flag for simple residual diagnostic
When enabled, bypasses mHC pre/post blocks and uses direct residual
connections with 0.1 scaling. This helps isolate whether the mHC
implementation is causing the garbage output.
2026-05-31 11:33:41 +00:00
d201a9334e CRITICAL FIX: Add YaRN RoPE scaling (factor=16)
The DSV4 Pro model uses rope_type='yarn' with factor=16. Our
build_rope_cache was using standard RoPE with theta=10000, completely
ignoring YaRN scaling. This produced wrong cos/sin values for all
positions, causing incorrect attention scores and garbage output.

YaRN modifies the RoPE frequencies:
- High-frequency components: unchanged
- Low-frequency components: scaled by 1/factor
- Medium: smooth interpolation

Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536
2026-05-31 11:25:52 +00:00
88719f39b4 Add single-layer trace (Phase 2.6) for detailed debugging 2026-05-31 11:20:46 +00:00
8256e23aed Fix mHCContext attribute access (not tuple unpacking) and enable attention diag 2026-05-31 11:10:37 +00:00
72c139a59f Enable MHC_DIAG for diagnostic run 2026-05-31 11:07:23 +00:00
cd661c2e40 Add attention and Q/KV diagnostics (MHC_DIAG flag) 2026-05-31 11:07:17 +00:00
9584fcbc23 Fix top5_ids variable name in decode logging 2026-05-31 10:54:40 +00:00
a6d56d10ca Add top-20 logging and thinking token detection in decode loop 2026-05-31 10:49:28 +00:00
d891ae7e96 Fix prompt format: use DeepSeek V4 chat tokens
The model was trained with DeepSeek-specific chat tokens:
  <|User|> (128803), <|Assistant|> (128804), <|EOT|> (128805)
  Thinking: fi (128821), fl (128822)

Previous manual assembly just concatenated raw text without these tokens,
causing the model to not recognize user/assistant boundaries.

Format: <BOS><|User|>system prompt\n\nuser prompt<|Assistant|>
2026-05-31 10:33:41 +00:00
f86742ef8e Cache layer weights on GPU — eliminates per-token CPU→GPU transfer
Previously, each prefill/decode token re-transferred ALL layer weights
from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made
prefill ~36s/token and caused the test to appear stuck.

Now: one-time cache_all_layer_weights() loads all 61 layers to their
target GPUs. Prefill should be ~1-2s/token instead of ~36s.

Also added flush=True to print statements so progress is visible.
2026-05-31 10:28:25 +00:00
ce3d6069cc CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post]
All three mHC parameter tensors (fn, base, scale) share the same ordering
as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)].

Previous code loaded base as [pre(4), post(4), res(16)] and scale as
[alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and
alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to
be computed with wrong bias values, allowing the residual to explode.

Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums,
C_l values) to verify doubly-stochastic constraint is satisfied.
2026-05-31 10:07:14 +00:00
9a43e9aa77 CRITICAL FIX: mHC fn weight row ordering was wrong
fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).

Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong:   fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
2026-05-31 10:02:57 +00:00
0346e479d4 Add system prompt, CLI args, inverse RoPE flag, minimal e2e test
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
  tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
  first 448 always un-RoPE'd, relative encoding may be intentional
2026-05-31 09:56:18 +00:00
33004dcbf4 Fix expert weight broadcasting (wt.item() for scalar multiply) 2026-05-31 09:22:27 +00:00
a2ee78b564 Fix RoPE shape bug (interleave needs separate even/odd assembly) 2026-05-31 09:15:59 +00:00
9d96c2fbbf CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip
BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16).
This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating
across 61 layers into garbage output. FP32 cache + FP32 arithmetic
gives exact round-trip (diff < 1e-7).

Also fixes: MoE expert loop indentation (was only running last expert).
2026-05-31 09:14:59 +00:00