nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	acc20dffd7	CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant input_scale is the activation quantization scale (for FP8 inputs). Since we use BF16 activations, the weight dequant is simply: lut[weight] * weight_scale * weight_scale_2 Folding input_scale in produced weights ~4000x too small, causing all attention and FFN outputs to be effectively zero. v0.1-e2e-working	2026-05-31 22:03:55 +00:00
biondizzle	4e64acbb64	fix MoE gate BF16/NVFP4 handling, add attention diagnostics	2026-05-31 21:57:47 +00:00
biondizzle	0d2b5ceb93	fix positions device mismatch: move to rope cache device in forward_attention	2026-05-31 21:54:56 +00:00
biondizzle	2676476013	fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16	2026-05-31 21:51:59 +00:00
biondizzle	eb08cd06d1	Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected - Fixed weight key format: model.layers.{li}.self_attn.* (was layers.{li}.attn.) - Added NVFP4 two-level scale: weight_scale weight_scale_2 * input_scale - Proper CSA compressor: overlapping Ca/Cb streams, token-level softmax - Proper HCA compressor: non-overlapping, single stream - Indexer: NVFP4 q_b_proj + weights_proj + own compressor at index_head_dim - Compressed KV (dim=hd) concatenated with SWA KV for attention - Correct MoE key format: gate_proj/up_proj/down_proj - Correct mHC key format: attn_hc.{fn,base,scale} and ffn_hc.{fn,base,scale} - No more disconnected compressor — full E2E pipeline	2026-05-31 21:48:59 +00:00
biondizzle	4988e77179	probe key format	2026-05-31 21:42:52 +00:00
biondizzle	ba915dbd53	add probe_shapes script	2026-05-31 21:41:31 +00:00
biondizzle	c54dd15550	find hc keys	2026-05-31 21:38:43 +00:00
biondizzle	52b4971711	Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn) - Fixed checkpoint key prefix: layers.{li}.attn.* and layers.{li}.ffn.* (was incorrectly model.layers.{li}.self_attn.* and .mlp.*) - Added Compressor (CSA ratio=4 overlapping, HCA ratio=128) - Added Indexer (CSA top-k selection) - Compressor wkv/wgate are BF16 (NOT NVFP4 — no .scale) - MoE gate is BF16 (not NVFP4) - Added KV cache with SWA ring buffer + compressed entries - Attention sinks as logit bias (paper D5c) - YaRN RoPE with factor=16 - Proper mHC with Sinkhorn-Knopp - HcHead for final mHC readout - Still TODO: proper compressed KV attention (currently SWA-only)	2026-05-31 21:36:17 +00:00
biondizzle	cec17fee7d	fixed prefix	2026-05-31 21:26:04 +00:00
biondizzle	696f3261ab	focused key dump	2026-05-31 21:25:31 +00:00
biondizzle	b7c9bb1262	dump all keys	2026-05-31 21:24:58 +00:00
biondizzle	54e2a3684a	filter expert keys	2026-05-31 21:24:35 +00:00
biondizzle	bafabda01f	add checkpoint key dump script	2026-05-31 21:24:14 +00:00
biondizzle	23f1cf4065	Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference)	2026-05-31 21:13:21 +00:00
biondizzle	274ea13251	Fix critical bug: add hc_head for final mHC readout (was using stream 0) The model uses DeepseekV4HyperHead to project from the 4-stream mHC residual to the final hidden state. Just taking stream 0 (X[:,0,:]) is WRONG — the hc_head learns how to combine the 4 streams. Also: - Remove --no-thinking mode (this is a reasoning model, it MUST think) - Increase default max_tokens from 512 to 4096 - Load hc_head weights (fn, base, scale) from checkpoint	2026-05-31 21:13:02 +00:00
biondizzle	baee36e728	Fix dtype mismatch in validate_layer: cast flat to float before F.linear	2026-05-31 20:23:18 +00:00
biondizzle	46c4ef2cf5	Add per-layer validation test (tests/validate_layer.py) Compares forward_layer output with step-by-step PyTorch reference to identify where residual blowup originates. Uses our own NVFP4 dequant — no HF dependency.	2026-05-31 20:22:13 +00:00
biondizzle	abe4210367	Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG	2026-05-31 20:21:03 +00:00
biondizzle	98fa410167	Add HF reference test script	2026-05-31 20:11:37 +00:00
biondizzle	a1b39adcaa	Add attention entropy diag (ATTN_DIAG), KV cache diag, --no-thinking mode	2026-05-31 19:29:55 +00:00
biondizzle	2a886fe0f2	Add --no-thinking mode to skip thinking tokens and use second-best	2026-05-31 19:24:21 +00:00
biondizzle	41ef0ebd0f	Add KV cache length diagnostic during decode	2026-05-31 19:17:24 +00:00
biondizzle	8baebf3c2e	Restore --skip-mhc arg, empty system prompt for testing	2026-05-31 19:04:53 +00:00
biondizzle	ca661d32e8	Empty system prompt for testing (was causing model to regurgitate AI assistant tokens)	2026-05-31 19:03:55 +00:00
biondizzle	b09b2cf511	Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+ - Layers 0-2 use hash routing (tid2eid lookup, uniform weights) - Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only) - Fixed e_bias key name: e_score_correction_bias (not e_bias) - Hash routing detection: check tid2eid present AND e_score_correction_bias absent	2026-05-31 18:52:38 +00:00
biondizzle	7d9e70c5d5	Fix remaining mHC API references: layer_compare.py, layer.py comment	2026-05-31 18:38:34 +00:00
biondizzle	7b123d159f	CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection): 1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb] - Was applying Sinkhorn to post values and 2*sigmoid to comb values - This caused residual to grow unbounded (no doubly-stochastic constraint) 2. comb (B_l) must be TRANSPOSED in post_block - HF: comb.transpose(-1,-2) @ hidden_streams - Was using B_l @ X_l without transpose 3. Sinkhorn must start from softmax(logits) + eps, not exp(logits) - HF: softmax → col norm → (iters-1) alternating - Was using exp → alternating (different convergence behavior) 4. Missing hc_eps on pre (A_l) - HF: sigmoid(...) + hc_eps - Was missing the eps guard 5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout - Matches checkpoint naming and HF model 6. Fixed fallback mHC initialization to use new API	2026-05-31 18:38:12 +00:00
biondizzle	f6c02f808f	Add layer-by-layer comparison test for debugging	2026-05-31 12:48:43 +00:00
biondizzle	6ad577bd18	Add HuggingFace reference comparison test	2026-05-31 12:05:19 +00:00
biondizzle	581c4170f9	Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len)	2026-05-31 11:57:23 +00:00
biondizzle	0f951a0b1a	Fix attention sinks: logit bias (HuggingFace reference), not dummy KV The HuggingFace reference treats attention sinks as a logit bias: 1. Compute raw Q*K scores 2. Concatenate sinks as a logit column 3. Softmax the combined logits 4. DROP the sink column (don't multiply by V) 5. Multiply by V Our old code added sinks as a dummy zero-KV entry, which diluted attention weights by adding an extra V=0 position to the softmax.	2026-05-31 11:53:43 +00:00
biondizzle	daed594902	CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj) The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q magnitudes are too large, causing attention scores to collapse to uniform (entropy ~3.2 with 24 positions) and the model to produce garbage. q_b_norm has no learnable parameters — just q / RMS(q). This explains the nearly-uniform attention weights we've been seeing.	2026-05-31 11:47:16 +00:00
biondizzle	dd50c355a6	Fix MHC_DIAG null check when SKIP_MHC is enabled	2026-05-31 11:37:32 +00:00
biondizzle	631e6ea3e4	Add --skip-mhc flag for simple residual diagnostic When enabled, bypasses mHC pre/post blocks and uses direct residual connections with 0.1 scaling. This helps isolate whether the mHC implementation is causing the garbage output.	2026-05-31 11:33:41 +00:00
biondizzle	d201a9334e	CRITICAL FIX: Add YaRN RoPE scaling (factor=16) The DSV4 Pro model uses rope_type='yarn' with factor=16. Our build_rope_cache was using standard RoPE with theta=10000, completely ignoring YaRN scaling. This produced wrong cos/sin values for all positions, causing incorrect attention scores and garbage output. YaRN modifies the RoPE frequencies: - High-frequency components: unchanged - Low-frequency components: scaled by 1/factor - Medium: smooth interpolation Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536	2026-05-31 11:25:52 +00:00
biondizzle	88719f39b4	Add single-layer trace (Phase 2.6) for detailed debugging	2026-05-31 11:20:46 +00:00
biondizzle	8256e23aed	Fix mHCContext attribute access (not tuple unpacking) and enable attention diag	2026-05-31 11:10:37 +00:00
biondizzle	72c139a59f	Enable MHC_DIAG for diagnostic run	2026-05-31 11:07:23 +00:00
biondizzle	cd661c2e40	Add attention and Q/KV diagnostics (MHC_DIAG flag)	2026-05-31 11:07:17 +00:00
biondizzle	9584fcbc23	Fix top5_ids variable name in decode logging	2026-05-31 10:54:40 +00:00
biondizzle	a6d56d10ca	Add top-20 logging and thinking token detection in decode loop	2026-05-31 10:49:28 +00:00
biondizzle	d891ae7e96	Fix prompt format: use DeepSeek V4 chat tokens The model was trained with DeepSeek-specific chat tokens: <｜User｜> (128803), <｜Assistant｜> (128804), <\|EOT\|> (128805) Thinking: ﬁ (128821), ﬂ (128822) Previous manual assembly just concatenated raw text without these tokens, causing the model to not recognize user/assistant boundaries. Format: <BOS><｜User｜>system prompt\n\nuser prompt<｜Assistant｜>	2026-05-31 10:33:41 +00:00
biondizzle	f86742ef8e	Cache layer weights on GPU — eliminates per-token CPU→GPU transfer Previously, each prefill/decode token re-transferred ALL layer weights from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made prefill ~36s/token and caused the test to appear stuck. Now: one-time cache_all_layer_weights() loads all 61 layers to their target GPUs. Prefill should be ~1-2s/token instead of ~36s. Also added flush=True to print statements so progress is visible.	2026-05-31 10:28:25 +00:00
biondizzle	ce3d6069cc	CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post] All three mHC parameter tensors (fn, base, scale) share the same ordering as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)]. Previous code loaded base as [pre(4), post(4), res(16)] and scale as [alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to be computed with wrong bias values, allowing the residual to explode. Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums, C_l values) to verify doubly-stochastic constraint is satisfied.	2026-05-31 10:07:14 +00:00
biondizzle	9a43e9aa77	CRITICAL FIX: mHC fn weight row ordering was wrong fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw] in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the residual to explode (\|X\| 0.8 → 160K across 61 layers). Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res	2026-05-31 10:02:57 +00:00
biondizzle	0346e479d4	Add system prompt, CLI args, inverse RoPE flag, minimal e2e test - System prompt added via chat template (reasoning model needs instructions) - MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens) - --no-inverse-rope flag to test without inverse RoPE on attn output - --skip-moe flag to debug with shared expert only - --max-tokens and --prompt CLI overrides - minimal_e2e_test(): processes 'The' through full model, checks logits, tracks per-layer residual stream evolution, reports NaN/Inf/spread - INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims, first 448 always un-RoPE'd, relative encoding may be intentional	2026-05-31 09:56:18 +00:00
biondizzle	429fc3db40	Fix expert weight indexing for 1D tensor	2026-05-31 09:23:10 +00:00
biondizzle	33004dcbf4	Fix expert weight broadcasting (wt.item() for scalar multiply)	2026-05-31 09:22:27 +00:00
biondizzle	1434b35971	Add residual diagnostic test — per-layer magnitude tracking	2026-05-31 09:21:41 +00:00

1 2 3 4 5 ...

1954 Commits