Commit Graph

1973 Commits

Author SHA1 Message Date
1dbc57e2cd fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time) 2026-05-31 23:36:01 +00:00
d05dd50bf5 fix: OperandMajorMode.K not MAJOR_K (correct CuTeDSL API) 2026-05-31 23:34:54 +00:00
a6a8755439 single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128) 2026-05-31 23:33:32 +00:00
80002f2efc single_shot: production NVFP4 GEMM for ALL attention projections
- Nvfp4Linear (CuTeDSL) for q_a, q_b, kv, o_b — NO more dequant+matmul
- Production FMHA (6-warp TMA multi-tile) with per-head sink bias
- Production MoE + Router + SharedExpert + mHC (unchanged)
- wo_a still uses BF16 grouped BMM (checkpoint is BF16)
- Compressor/Indexer still PyTorch ref (not yet on tensor cores)
- Proper weight dimensions: q_a(7168->1536), q_b(1536->65536), kv(7168->512), o_b(16384->7168)
2026-05-31 23:28:16 +00:00
32efd5139d Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E) 2026-05-31 23:21:09 +00:00
e45c0ff51b single_shot: use reference dequant for attn projections, focus on MoE+FMHA
Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT
triggered by _ensure_initialized). Disable for now to validate
the critical paths first:
- Production FMHA with sink bias
- Production MoE (Nvfp4MoE + Nvfp4SharedExpert)
- Production Router (dense/hash)
- Production mHC

Attention projections use reference dequant+matmul for now.
Will re-enable Nvfp4Linear after validating MoE path.
2026-05-31 23:20:04 +00:00
dfbffa1df1 single_shot: CUDA_LAUNCH_BLOCKING for debugging 2026-05-31 23:18:35 +00:00
a66fdf6049 single_shot: add sync to catch CUDA errors early 2026-05-31 23:17:46 +00:00
0b35c36d23 single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init
- MoE expert weights loaded per-expert to GPU (no huge CPU tensors)
- Nvfp4Linear finalize_weights deferred (lazy on first forward)
- Shared expert weights loaded directly to GPU
- Added GPU cache cleanup at start
- Fixed shared expert finalize_weights (now lazy)
2026-05-31 23:16:45 +00:00
050b5ee449 Fix n_h reference before assignment in single_shot 2026-05-31 23:14:24 +00:00
c5adbbfde6 FMHA sink: don't double-scale sink bias
The sink bias from the checkpoint is already in the scaled domain
(added to QK*scale in the reference softmax). The kernel's
running_max is max(QK*scale), so the sink should be compared
directly without multiplying by scale again.
2026-05-31 23:12:20 +00:00
4adee1207f FMHA: zero-init my_p_vals to fix N<128 padding NaN
When N<128, padded KV positions have my_p_vals[col] uninitialized
for col >= kv_len. The PV GEMM then computes garbage_P × zero_V,
which can produce NaN on tensor cores (0 × NaN = NaN).
Fix: zero-initialize my_p_vals so padded positions contribute 0.
2026-05-31 23:11:12 +00:00
13be3ad443 FMHA sink bias in kernel + single_shot production rewrite
FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
  new_max = max(running_max, sink_bias * scale)
  rescale existing O_unnorm and running_sum
  running_sum += exp(sink_bias * scale - new_max)
  No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor

single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths
2026-05-31 23:10:13 +00:00
23e88638aa single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer)
Build stacked (E, N, K) tensors incrementally on CPU, then move to GPU
in one shot. Avoids holding 384 individual expert weight+scale tensors
on GPU simultaneously (~3x memory savings per layer).
2026-05-31 22:55:11 +00:00
92200367f3 FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128
ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment
but then passed the PADDED N to the kernel as s_k (logical KV length).
This told the kernel all 128 entries were valid, so softmax ran over
zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128).

FIX: Pass N_orig (true sequence length) as s_k for softmax masking,
and N_padded (physical size) only for TMA descriptor creation.
The kernel's existing col < kv_len guard correctly excludes padded
entries from row_max and exp_sum calculations.

Files changed:
- fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for
  params.s_k and N_padded for TMA descriptors
- fmha_multitile_op.py: pass N_orig and N_padded separately
- single_shot_inference.py: removed SDPA fallback (kernel now correct)
2026-05-31 22:52:39 +00:00
d40821c843 single_shot: fix memory (no double-loading MoE weights), FMHA short-seq fallback
- Don't cache MoE/SE expert weights in layer_w (handled by runners)
  This saves ~10.6GB/layer × 61 = ~647GB of double-loaded GPU memory
- Add FMHA fallback for seq_len < 128 (known kernel limitation:
  zero-padding dilutes softmax). TODO: fix kernel to mask padded entries.
- Free all_w and empty GPU caches after building runners
2026-05-31 22:49:15 +00:00
91568e12d4 single_shot_inference.py: production kernel stack version
- FMHA: 6-warp TMA multi-tile kernel via dsv4_attention
- MoE: Nvfp4MoE (CuTeDSL NVFP4 grouped GEMM, fused SwiGLU)
- Shared expert: Nvfp4SharedExpert (CuTeDSL NVFP4 single-group GEMM)
- Router: production dense/hash router kernels
- Compressor: CSA/HCA token-level softmax
- Indexer: score+topk
- mHC: Sinkhorn-Knopp, B_l transposed, [pre,post,comb]
- No PyTorch SDPA, no F.linear for kernel paths
- Falls back to dequant BF16 only if production kernels fail
- FP32 RoPE cache (BF16 destroys cos²+sin²=1)
2026-05-31 22:45:44 +00:00
fb96c34b89 rename: single_shot_inference.py → single_shot_PYTORCH_REFERENCE.py 2026-05-31 22:42:06 +00:00
79d1a83348 Add NEXT_STEPS.md: post v0.1 issues, kernel migration plan, lessons learned 2026-05-31 22:30:34 +00:00
acc20dffd7 CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant
input_scale is the activation quantization scale (for FP8 inputs).
Since we use BF16 activations, the weight dequant is simply:
  lut[weight] * weight_scale * weight_scale_2

Folding input_scale in produced weights ~4000x too small,
causing all attention and FFN outputs to be effectively zero.
v0.1-e2e-working
2026-05-31 22:03:55 +00:00
4e64acbb64 fix MoE gate BF16/NVFP4 handling, add attention diagnostics 2026-05-31 21:57:47 +00:00
0d2b5ceb93 fix positions device mismatch: move to rope cache device in forward_attention 2026-05-31 21:54:56 +00:00
2676476013 fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16 2026-05-31 21:51:59 +00:00
eb08cd06d1 Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected
- Fixed weight key format: model.layers.{li}.self_attn.* (was layers.{li}.attn.*)
- Added NVFP4 two-level scale: weight_scale * weight_scale_2 * input_scale
- Proper CSA compressor: overlapping Ca/Cb streams, token-level softmax
- Proper HCA compressor: non-overlapping, single stream
- Indexer: NVFP4 q_b_proj + weights_proj + own compressor at index_head_dim
- Compressed KV (dim=hd) concatenated with SWA KV for attention
- Correct MoE key format: gate_proj/up_proj/down_proj
- Correct mHC key format: attn_hc.{fn,base,scale} and ffn_hc.{fn,base,scale}
- No more disconnected compressor — full E2E pipeline
2026-05-31 21:48:59 +00:00
4988e77179 probe key format 2026-05-31 21:42:52 +00:00
ba915dbd53 add probe_shapes script 2026-05-31 21:41:31 +00:00
c54dd15550 find hc keys 2026-05-31 21:38:43 +00:00
52b4971711 Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn)
- Fixed checkpoint key prefix: layers.{li}.attn.* and layers.{li}.ffn.*
  (was incorrectly model.layers.{li}.self_attn.* and .mlp.*)
- Added Compressor (CSA ratio=4 overlapping, HCA ratio=128)
- Added Indexer (CSA top-k selection)
- Compressor wkv/wgate are BF16 (NOT NVFP4 — no .scale)
- MoE gate is BF16 (not NVFP4)
- Added KV cache with SWA ring buffer + compressed entries
- Attention sinks as logit bias (paper D5c)
- YaRN RoPE with factor=16
- Proper mHC with Sinkhorn-Knopp
- HcHead for final mHC readout
- Still TODO: proper compressed KV attention (currently SWA-only)
2026-05-31 21:36:17 +00:00
cec17fee7d fixed prefix 2026-05-31 21:26:04 +00:00
696f3261ab focused key dump 2026-05-31 21:25:31 +00:00
b7c9bb1262 dump all keys 2026-05-31 21:24:58 +00:00
54e2a3684a filter expert keys 2026-05-31 21:24:35 +00:00
bafabda01f add checkpoint key dump script 2026-05-31 21:24:14 +00:00
23f1cf4065 Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference) 2026-05-31 21:13:21 +00:00
274ea13251 Fix critical bug: add hc_head for final mHC readout (was using stream 0)
The model uses DeepseekV4HyperHead to project from the 4-stream mHC
residual to the final hidden state. Just taking stream 0 (X[:,0,:])
is WRONG — the hc_head learns how to combine the 4 streams.

Also:
- Remove --no-thinking mode (this is a reasoning model, it MUST think)
- Increase default max_tokens from 512 to 4096
- Load hc_head weights (fn, base, scale) from checkpoint
2026-05-31 21:13:02 +00:00
baee36e728 Fix dtype mismatch in validate_layer: cast flat to float before F.linear 2026-05-31 20:23:18 +00:00
46c4ef2cf5 Add per-layer validation test (tests/validate_layer.py)
Compares forward_layer output with step-by-step PyTorch reference
to identify where residual blowup originates. Uses our own NVFP4
dequant — no HF dependency.
2026-05-31 20:22:13 +00:00
abe4210367 Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG 2026-05-31 20:21:03 +00:00
98fa410167 Add HF reference test script 2026-05-31 20:11:37 +00:00
a1b39adcaa Add attention entropy diag (ATTN_DIAG), KV cache diag, --no-thinking mode 2026-05-31 19:29:55 +00:00
2a886fe0f2 Add --no-thinking mode to skip thinking tokens and use second-best 2026-05-31 19:24:21 +00:00
41ef0ebd0f Add KV cache length diagnostic during decode 2026-05-31 19:17:24 +00:00
8baebf3c2e Restore --skip-mhc arg, empty system prompt for testing 2026-05-31 19:04:53 +00:00
ca661d32e8 Empty system prompt for testing (was causing model to regurgitate AI assistant tokens) 2026-05-31 19:03:55 +00:00
b09b2cf511 Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+
- Layers 0-2 use hash routing (tid2eid lookup, uniform weights)
- Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only)
- Fixed e_bias key name: e_score_correction_bias (not e_bias)
- Hash routing detection: check tid2eid present AND e_score_correction_bias absent
2026-05-31 18:52:38 +00:00
7d9e70c5d5 Fix remaining mHC API references: layer_compare.py, layer.py comment 2026-05-31 18:38:34 +00:00
7b123d159f CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
   - Was applying Sinkhorn to post values and 2*sigmoid to comb values
   - This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
   - HF: comb.transpose(-1,-2) @ hidden_streams
   - Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
   - HF: softmax → col norm → (iters-1) alternating
   - Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
   - HF: sigmoid(...) + hc_eps
   - Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
   - Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
2026-05-31 18:38:12 +00:00
f6c02f808f Add layer-by-layer comparison test for debugging 2026-05-31 12:48:43 +00:00
6ad577bd18 Add HuggingFace reference comparison test 2026-05-31 12:05:19 +00:00
581c4170f9 Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len) 2026-05-31 11:57:23 +00:00