nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	cd661c2e40	Add attention and Q/KV diagnostics (MHC_DIAG flag)	2026-05-31 11:07:17 +00:00
biondizzle	9584fcbc23	Fix top5_ids variable name in decode logging	2026-05-31 10:54:40 +00:00
biondizzle	a6d56d10ca	Add top-20 logging and thinking token detection in decode loop	2026-05-31 10:49:28 +00:00
biondizzle	d891ae7e96	Fix prompt format: use DeepSeek V4 chat tokens The model was trained with DeepSeek-specific chat tokens: <｜User｜> (128803), <｜Assistant｜> (128804), <\|EOT\|> (128805) Thinking: ﬁ (128821), ﬂ (128822) Previous manual assembly just concatenated raw text without these tokens, causing the model to not recognize user/assistant boundaries. Format: <BOS><｜User｜>system prompt\n\nuser prompt<｜Assistant｜>	2026-05-31 10:33:41 +00:00
biondizzle	f86742ef8e	Cache layer weights on GPU — eliminates per-token CPU→GPU transfer Previously, each prefill/decode token re-transferred ALL layer weights from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made prefill ~36s/token and caused the test to appear stuck. Now: one-time cache_all_layer_weights() loads all 61 layers to their target GPUs. Prefill should be ~1-2s/token instead of ~36s. Also added flush=True to print statements so progress is visible.	2026-05-31 10:28:25 +00:00
biondizzle	ce3d6069cc	CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post] All three mHC parameter tensors (fn, base, scale) share the same ordering as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)]. Previous code loaded base as [pre(4), post(4), res(16)] and scale as [alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to be computed with wrong bias values, allowing the residual to explode. Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums, C_l values) to verify doubly-stochastic constraint is satisfied.	2026-05-31 10:07:14 +00:00
biondizzle	9a43e9aa77	CRITICAL FIX: mHC fn weight row ordering was wrong fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw] in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the residual to explode (\|X\| 0.8 → 160K across 61 layers). Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res	2026-05-31 10:02:57 +00:00
biondizzle	0346e479d4	Add system prompt, CLI args, inverse RoPE flag, minimal e2e test - System prompt added via chat template (reasoning model needs instructions) - MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens) - --no-inverse-rope flag to test without inverse RoPE on attn output - --skip-moe flag to debug with shared expert only - --max-tokens and --prompt CLI overrides - minimal_e2e_test(): processes 'The' through full model, checks logits, tracks per-layer residual stream evolution, reports NaN/Inf/spread - INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims, first 448 always un-RoPE'd, relative encoding may be intentional	2026-05-31 09:56:18 +00:00
biondizzle	429fc3db40	Fix expert weight indexing for 1D tensor	2026-05-31 09:23:10 +00:00
biondizzle	33004dcbf4	Fix expert weight broadcasting (wt.item() for scalar multiply)	2026-05-31 09:22:27 +00:00
biondizzle	1434b35971	Add residual diagnostic test — per-layer magnitude tracking	2026-05-31 09:21:41 +00:00
biondizzle	1c18c16c68	Fix production rope.py: FP32 arithmetic for forward_rope_partial + inverse_rope_bf16	2026-05-31 09:17:36 +00:00
biondizzle	970869d017	Fix mHCBlock import + relax RoPE round-trip threshold (BF16 noise expected)	2026-05-31 09:17:07 +00:00
biondizzle	a2ee78b564	Fix RoPE shape bug (interleave needs separate even/odd assembly)	2026-05-31 09:15:59 +00:00
biondizzle	9d96c2fbbf	CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16). This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating across 61 layers into garbage output. FP32 cache + FP32 arithmetic gives exact round-trip (diff < 1e-7). Also fixes: MoE expert loop indentation (was only running last expert).	2026-05-31 09:14:59 +00:00
biondizzle	db74a887ab	Add minimal e2e test + fix MoE expert loop bug (indentation)	2026-05-31 09:14:03 +00:00
biondizzle	e195d9d3a7	add SKIP_ROUTED_MOE debug flag, re-enable sinks	2026-05-31 07:02:38 +00:00
biondizzle	4f28673bec	debug: disable sinks in SDPA to check \|X\| impact	2026-05-31 06:51:58 +00:00
biondizzle	e3db90b56c	switch back to original prompt	2026-05-31 06:40:01 +00:00
biondizzle	d2cf5ccc32	CRITICAL FIX: use SDPA for short sequences (FMHA padding bug) FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens), the 123 padded zero-K entries contribute exp(0)=1 to the softmax denominator, diluting real attention weights by ~128/5 = 25.6x. This caused the model to produce incoherent output for short prompts. Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer sequences where the padding effect is negligible. Also: SDPA path includes attention sinks (paper D5c), FMHA path uses analytic sink correction via LSE.	2026-05-31 06:39:23 +00:00
biondizzle	5f98855141	test with simpler prompt	2026-05-31 06:28:45 +00:00
biondizzle	152af7295a	debug: compare FMHA vs SDPA output at layer 0	2026-05-31 06:16:58 +00:00
biondizzle	59c75ca4e9	fix: cast attn_out back to BF16 after sink correction	2026-05-31 06:07:06 +00:00
biondizzle	e5245ea34e	fix: V tensor must be (B, n_h, hd, N) for FMHA — was transposed wrong	2026-05-31 06:03:13 +00:00
biondizzle	91abf0f921	FMHA + analytic sink bias correction using LSE Instead of SDPA with virtual sink position, use the production FMHA kernel and apply the sink bias as a post-hoc correction on the output. The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink)) This simulates the attention sink (paper D5c) without modifying the FMHA kernel. The sink absorbs some attention mass, reducing the normalization constant and scaling down the output.	2026-05-31 05:58:01 +00:00
biondizzle	fac269c938	fix verify_attention: proper multi-head SDPA + GQA	2026-05-31 05:55:10 +00:00
biondizzle	2333fc8b4b	fix verify_attention.py: proper nvfp4_linear calls	2026-05-31 05:53:49 +00:00
biondizzle	c09f68c867	add verify_attention.py: single-layer attention component test	2026-05-31 05:51:36 +00:00
biondizzle	04dd7545b3	switch to production FMHA for full run	2026-05-31 04:51:16 +00:00
biondizzle	738088cf49	revert: K=V with RoPE + inverse RoPE is the correct DSV4 approach	2026-05-31 04:51:10 +00:00
biondizzle	781ee43521	try separate K (RoPE'd) and V (raw) — no inverse RoPE needed	2026-05-31 04:46:14 +00:00
biondizzle	889521009b	re-enable inverse RoPE (confirmed necessary — without it output is garbage)	2026-05-31 04:45:58 +00:00
biondizzle	92e465ca04	debug: disable inverse RoPE to check impact on output	2026-05-31 04:40:34 +00:00
biondizzle	c69dc51b3b	switch to SDPA with sinks (better residual control)	2026-05-31 04:38:41 +00:00
biondizzle	3ed8f3cc44	switch back to production FMHA kernel (with FP4 LUT fix)	2026-05-31 04:32:01 +00:00
biondizzle	ae79bd8fce	debug: add top-5 logit predictions	2026-05-31 04:25:01 +00:00
biondizzle	aafe2eee12	CRITICAL FIX: FP4 LUT was 4x too large! E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24]. The old LUT was 4x the correct values, causing every NVFP4 dequantized weight to be 4x too large. This compounded across 61 layers, causing the residual stream to explode and producing gibberish output. This is the root cause of the residual growth and incoherent generation.	2026-05-31 04:16:13 +00:00
biondizzle	b8c8da91fe	fix: restore RoPE functions that were lost during mHC refactor	2026-05-31 04:10:51 +00:00
biondizzle	3f04a72af4	refactor: use production mHCLayer from dsv4.layers.mhc Replace custom mHCBlock with wrapper around the tested production mHCLayer class. This eliminates any bugs in my custom implementation and uses the same code path that the model was designed for. Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res	2026-05-31 04:06:58 +00:00
biondizzle	b519108cab	fix: restore kv_cache.append that was accidentally removed	2026-05-31 03:56:58 +00:00
biondizzle	22a89b5a45	add attention sinks to SDPA path (paper D5c)	2026-05-31 03:52:59 +00:00
biondizzle	1905f19b8d	fix: define q_input before USE_SDPA branch	2026-05-31 03:45:09 +00:00
biondizzle	cd073ad867	use PyTorch SDPA for correctness (no sink bias in FMHA kernel yet)	2026-05-31 03:42:03 +00:00
biondizzle	171a9e0d10	disable diagnostics for clean production run	2026-05-31 03:32:17 +00:00
biondizzle	3f9b441428	diag: fix n_layers reference in forward_layer, add late-layer diags	2026-05-31 03:28:53 +00:00
biondizzle	5b834a0599	diag: add late-layer diagnostics, fix ffn ctx variable	2026-05-31 03:25:55 +00:00
biondizzle	690c0a1121	CRITICAL FIX: mHC base/scale ordering was wrong Checkpoint order is [pre, post, res] not [pre, res, post]: - base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res - scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res - W_stacked rows: [W_pre(4), W_post(4), W_res(16)] - Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24] This was causing B_l to be near-identity and C_l to be near-2.0, leading to exponential residual stream growth.	2026-05-31 03:16:07 +00:00
biondizzle	c3a2656c48	diag: add FFN and pre_block diagnostics	2026-05-31 03:12:52 +00:00
biondizzle	79ba7e6636	diag: add mHC diagnostics for first 3 layers	2026-05-31 03:10:05 +00:00
biondizzle	a262492e51	fix: FMHA K/V tensor shape (was permuting cache), add q_a_norm and kv_norm	2026-05-31 03:04:53 +00:00

1 2 3 4 5 ...

1915 Commits