nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	a6a8755439	single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128)	2026-05-31 23:33:32 +00:00
biondizzle	80002f2efc	single_shot: production NVFP4 GEMM for ALL attention projections - Nvfp4Linear (CuTeDSL) for q_a, q_b, kv, o_b — NO more dequant+matmul - Production FMHA (6-warp TMA multi-tile) with per-head sink bias - Production MoE + Router + SharedExpert + mHC (unchanged) - wo_a still uses BF16 grouped BMM (checkpoint is BF16) - Compressor/Indexer still PyTorch ref (not yet on tensor cores) - Proper weight dimensions: q_a(7168->1536), q_b(1536->65536), kv(7168->512), o_b(16384->7168)	2026-05-31 23:28:16 +00:00
biondizzle	32efd5139d	Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E)	2026-05-31 23:21:09 +00:00
biondizzle	e45c0ff51b	single_shot: use reference dequant for attn projections, focus on MoE+FMHA Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT triggered by _ensure_initialized). Disable for now to validate the critical paths first: - Production FMHA with sink bias - Production MoE (Nvfp4MoE + Nvfp4SharedExpert) - Production Router (dense/hash) - Production mHC Attention projections use reference dequant+matmul for now. Will re-enable Nvfp4Linear after validating MoE path.	2026-05-31 23:20:04 +00:00
biondizzle	dfbffa1df1	single_shot: CUDA_LAUNCH_BLOCKING for debugging	2026-05-31 23:18:35 +00:00
biondizzle	a66fdf6049	single_shot: add sync to catch CUDA errors early	2026-05-31 23:17:46 +00:00
biondizzle	0b35c36d23	single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init - MoE expert weights loaded per-expert to GPU (no huge CPU tensors) - Nvfp4Linear finalize_weights deferred (lazy on first forward) - Shared expert weights loaded directly to GPU - Added GPU cache cleanup at start - Fixed shared expert finalize_weights (now lazy)	2026-05-31 23:16:45 +00:00
biondizzle	050b5ee449	Fix n_h reference before assignment in single_shot	2026-05-31 23:14:24 +00:00
biondizzle	13be3ad443	FMHA sink bias in kernel + single_shot production rewrite FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh): - Added sink_bias field to FmhaTmaMultiRowMultiTileParams - After KV tile loop, sink logit is included in online softmax rescale: new_max = max(running_max, sink_bias * scale) rescale existing O_unnorm and running_sum running_sum += exp(sink_bias * scale - new_max) No PV contribution from sink (D5c: single softmax) - C API: fmha_multitile_decode_launch now takes sink_bias_ptr - Python: fmha_multitile_decode_raw accepts attn_sink tensor single_shot_inference.py: - Full rewrite to use production kernel stack - mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp) - Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b - FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback) - MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback) - Router: production dense/hash dispatch - Compressor/Indexer: reference dequant (not yet on tensor cores) - NO try/except fallbacks on production paths	2026-05-31 23:10:13 +00:00
biondizzle	23e88638aa	single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer) Build stacked (E, N, K) tensors incrementally on CPU, then move to GPU in one shot. Avoids holding 384 individual expert weight+scale tensors on GPU simultaneously (~3x memory savings per layer).	2026-05-31 22:55:11 +00:00
biondizzle	92200367f3	FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128 ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment but then passed the PADDED N to the kernel as s_k (logical KV length). This told the kernel all 128 entries were valid, so softmax ran over zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128). FIX: Pass N_orig (true sequence length) as s_k for softmax masking, and N_padded (physical size) only for TMA descriptor creation. The kernel's existing col < kv_len guard correctly excludes padded entries from row_max and exp_sum calculations. Files changed: - fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for params.s_k and N_padded for TMA descriptors - fmha_multitile_op.py: pass N_orig and N_padded separately - single_shot_inference.py: removed SDPA fallback (kernel now correct)	2026-05-31 22:52:39 +00:00
biondizzle	d40821c843	single_shot: fix memory (no double-loading MoE weights), FMHA short-seq fallback - Don't cache MoE/SE expert weights in layer_w (handled by runners) This saves ~10.6GB/layer × 61 = ~647GB of double-loaded GPU memory - Add FMHA fallback for seq_len < 128 (known kernel limitation: zero-padding dilutes softmax). TODO: fix kernel to mask padded entries. - Free all_w and empty GPU caches after building runners	2026-05-31 22:49:15 +00:00
biondizzle	91568e12d4	single_shot_inference.py: production kernel stack version - FMHA: 6-warp TMA multi-tile kernel via dsv4_attention - MoE: Nvfp4MoE (CuTeDSL NVFP4 grouped GEMM, fused SwiGLU) - Shared expert: Nvfp4SharedExpert (CuTeDSL NVFP4 single-group GEMM) - Router: production dense/hash router kernels - Compressor: CSA/HCA token-level softmax - Indexer: score+topk - mHC: Sinkhorn-Knopp, B_l transposed, [pre,post,comb] - No PyTorch SDPA, no F.linear for kernel paths - Falls back to dequant BF16 only if production kernels fail - FP32 RoPE cache (BF16 destroys cos²+sin²=1)	2026-05-31 22:45:44 +00:00
biondizzle	fb96c34b89	rename: single_shot_inference.py → single_shot_PYTORCH_REFERENCE.py	2026-05-31 22:42:06 +00:00
biondizzle	acc20dffd7	CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant input_scale is the activation quantization scale (for FP8 inputs). Since we use BF16 activations, the weight dequant is simply: lut[weight] * weight_scale * weight_scale_2 Folding input_scale in produced weights ~4000x too small, causing all attention and FFN outputs to be effectively zero.	2026-05-31 22:03:55 +00:00
biondizzle	4e64acbb64	fix MoE gate BF16/NVFP4 handling, add attention diagnostics	2026-05-31 21:57:47 +00:00
biondizzle	0d2b5ceb93	fix positions device mismatch: move to rope cache device in forward_attention	2026-05-31 21:54:56 +00:00
biondizzle	2676476013	fix mHC pre_block bmm dtype mismatch: A is FP32, X is BF16	2026-05-31 21:51:59 +00:00
biondizzle	eb08cd06d1	Rewrite single_shot_inference.py: correct weight keys, NVFP4 two-level scale, compressor+indexer connected - Fixed weight key format: model.layers.{li}.self_attn.* (was layers.{li}.attn.) - Added NVFP4 two-level scale: weight_scale weight_scale_2 * input_scale - Proper CSA compressor: overlapping Ca/Cb streams, token-level softmax - Proper HCA compressor: non-overlapping, single stream - Indexer: NVFP4 q_b_proj + weights_proj + own compressor at index_head_dim - Compressed KV (dim=hd) concatenated with SWA KV for attention - Correct MoE key format: gate_proj/up_proj/down_proj - Correct mHC key format: attn_hc.{fn,base,scale} and ffn_hc.{fn,base,scale} - No more disconnected compressor — full E2E pipeline	2026-05-31 21:48:59 +00:00
biondizzle	52b4971711	Full E2E single-shot: compressor, indexer, correct checkpoint keys (layers.{li}.attn/ffn) - Fixed checkpoint key prefix: layers.{li}.attn.* and layers.{li}.ffn.* (was incorrectly model.layers.{li}.self_attn.* and .mlp.*) - Added Compressor (CSA ratio=4 overlapping, HCA ratio=128) - Added Indexer (CSA top-k selection) - Compressor wkv/wgate are BF16 (NOT NVFP4 — no .scale) - MoE gate is BF16 (not NVFP4) - Added KV cache with SWA ring buffer + compressed entries - Attention sinks as logit bias (paper D5c) - YaRN RoPE with factor=16 - Proper mHC with Sinkhorn-Knopp - HcHead for final mHC readout - Still TODO: proper compressed KV attention (currently SWA-only)	2026-05-31 21:36:17 +00:00
biondizzle	23f1cf4065	Fix HcHead: use FP32 for RMSNorm + linear (matches HF reference)	2026-05-31 21:13:21 +00:00
biondizzle	274ea13251	Fix critical bug: add hc_head for final mHC readout (was using stream 0) The model uses DeepseekV4HyperHead to project from the 4-stream mHC residual to the final hidden state. Just taking stream 0 (X[:,0,:]) is WRONG — the hc_head learns how to combine the 4 streams. Also: - Remove --no-thinking mode (this is a reasoning model, it MUST think) - Increase default max_tokens from 512 to 4096 - Load hc_head weights (fn, base, scale) from checkpoint	2026-05-31 21:13:02 +00:00
biondizzle	abe4210367	Add compact per-layer residual trace (GROWTH_DIAG), disable verbose ATTN_DIAG	2026-05-31 20:21:03 +00:00
biondizzle	a1b39adcaa	Add attention entropy diag (ATTN_DIAG), KV cache diag, --no-thinking mode	2026-05-31 19:29:55 +00:00
biondizzle	2a886fe0f2	Add --no-thinking mode to skip thinking tokens and use second-best	2026-05-31 19:24:21 +00:00
biondizzle	41ef0ebd0f	Add KV cache length diagnostic during decode	2026-05-31 19:17:24 +00:00
biondizzle	8baebf3c2e	Restore --skip-mhc arg, empty system prompt for testing	2026-05-31 19:04:53 +00:00
biondizzle	ca661d32e8	Empty system prompt for testing (was causing model to regurgitate AI assistant tokens)	2026-05-31 19:03:55 +00:00
biondizzle	b09b2cf511	Fix MoE routing: hash layers 0-2 (tid2eid), e_score_correction_bias for layers 3+ - Layers 0-2 use hash routing (tid2eid lookup, uniform weights) - Layers 3+ use noaux_tc (sqrt(softplus) + e_score_correction_bias for selection only) - Fixed e_bias key name: e_score_correction_bias (not e_bias) - Hash routing detection: check tid2eid present AND e_score_correction_bias absent	2026-05-31 18:52:38 +00:00
biondizzle	7b123d159f	CRITICAL FIX: mHC fn/base/scale ordering [pre,post,comb] + comb transposed + Sinkhorn softmax Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection): 1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb] - Was applying Sinkhorn to post values and 2*sigmoid to comb values - This caused residual to grow unbounded (no doubly-stochastic constraint) 2. comb (B_l) must be TRANSPOSED in post_block - HF: comb.transpose(-1,-2) @ hidden_streams - Was using B_l @ X_l without transpose 3. Sinkhorn must start from softmax(logits) + eps, not exp(logits) - HF: softmax → col norm → (iters-1) alternating - Was using exp → alternating (different convergence behavior) 4. Missing hc_eps on pre (A_l) - HF: sigmoid(...) + hc_eps - Was missing the eps guard 5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout - Matches checkpoint naming and HF model 6. Fixed fallback mHC initialization to use new API	2026-05-31 18:38:12 +00:00
biondizzle	581c4170f9	Fix sink logits shape: (n_h, T, 1) for concatenation with (n_h, T, seq_len)	2026-05-31 11:57:23 +00:00
biondizzle	0f951a0b1a	Fix attention sinks: logit bias (HuggingFace reference), not dummy KV The HuggingFace reference treats attention sinks as a logit bias: 1. Compute raw Q*K scores 2. Concatenate sinks as a logit column 3. Softmax the combined logits 4. DROP the sink column (don't multiply by V) 5. Multiply by V Our old code added sinks as a dummy zero-KV entry, which diluted attention weights by adding an extra V=0 position to the softmax.	2026-05-31 11:53:43 +00:00
biondizzle	daed594902	CRITICAL FIX: Add missing q_b_norm (unweighted RMSNorm after q_b_proj) The HuggingFace reference (DeepseekV4ForCausalLM) applies an unweighted RMSNorm after q_b_proj, normalizing Q before attention. Without it, Q magnitudes are too large, causing attention scores to collapse to uniform (entropy ~3.2 with 24 positions) and the model to produce garbage. q_b_norm has no learnable parameters — just q / RMS(q). This explains the nearly-uniform attention weights we've been seeing.	2026-05-31 11:47:16 +00:00
biondizzle	dd50c355a6	Fix MHC_DIAG null check when SKIP_MHC is enabled	2026-05-31 11:37:32 +00:00
biondizzle	631e6ea3e4	Add --skip-mhc flag for simple residual diagnostic When enabled, bypasses mHC pre/post blocks and uses direct residual connections with 0.1 scaling. This helps isolate whether the mHC implementation is causing the garbage output.	2026-05-31 11:33:41 +00:00
biondizzle	d201a9334e	CRITICAL FIX: Add YaRN RoPE scaling (factor=16) The DSV4 Pro model uses rope_type='yarn' with factor=16. Our build_rope_cache was using standard RoPE with theta=10000, completely ignoring YaRN scaling. This produced wrong cos/sin values for all positions, causing incorrect attention scores and garbage output. YaRN modifies the RoPE frequencies: - High-frequency components: unchanged - Low-frequency components: scaled by 1/factor - Medium: smooth interpolation Config: factor=16, beta_fast=32, beta_slow=1, orig_max_pos=65536	2026-05-31 11:25:52 +00:00
biondizzle	88719f39b4	Add single-layer trace (Phase 2.6) for detailed debugging	2026-05-31 11:20:46 +00:00
biondizzle	8256e23aed	Fix mHCContext attribute access (not tuple unpacking) and enable attention diag	2026-05-31 11:10:37 +00:00
biondizzle	72c139a59f	Enable MHC_DIAG for diagnostic run	2026-05-31 11:07:23 +00:00
biondizzle	cd661c2e40	Add attention and Q/KV diagnostics (MHC_DIAG flag)	2026-05-31 11:07:17 +00:00
biondizzle	9584fcbc23	Fix top5_ids variable name in decode logging	2026-05-31 10:54:40 +00:00
biondizzle	a6d56d10ca	Add top-20 logging and thinking token detection in decode loop	2026-05-31 10:49:28 +00:00
biondizzle	d891ae7e96	Fix prompt format: use DeepSeek V4 chat tokens The model was trained with DeepSeek-specific chat tokens: <｜User｜> (128803), <｜Assistant｜> (128804), <\|EOT\|> (128805) Thinking: ﬁ (128821), ﬂ (128822) Previous manual assembly just concatenated raw text without these tokens, causing the model to not recognize user/assistant boundaries. Format: <BOS><｜User｜>system prompt\n\nuser prompt<｜Assistant｜>	2026-05-31 10:33:41 +00:00
biondizzle	f86742ef8e	Cache layer weights on GPU — eliminates per-token CPU→GPU transfer Previously, each prefill/decode token re-transferred ALL layer weights from CPU to GPU (66 tokens × 61 layers = 4026 transfers). This made prefill ~36s/token and caused the test to appear stuck. Now: one-time cache_all_layer_weights() loads all 61 layers to their target GPUs. Prefill should be ~1-2s/token instead of ~36s. Also added flush=True to print statements so progress is visible.	2026-05-31 10:28:25 +00:00
biondizzle	ce3d6069cc	CRITICAL FIX: mHC base/scale ordering matches fn ordering [pre, res, post] All three mHC parameter tensors (fn, base, scale) share the same ordering as _dynamic_params' A/B/C split: [pre(4), res(16), post(4)]. Previous code loaded base as [pre(4), post(4), res(16)] and scale as [alpha_pre, alpha_post, alpha_res] — swapping S_res and S_post, and alpha_res and alpha_post. This caused the Sinkhorn-Knopp B_l matrix to be computed with wrong bias values, allowing the residual to explode. Also: added MHC_DIAG flag for per-layer diagnostics (B_l row/col sums, C_l values) to verify doubly-stochastic constraint is satisfied.	2026-05-31 10:07:14 +00:00
biondizzle	9a43e9aa77	CRITICAL FIX: mHC fn weight row ordering was wrong fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw] in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the residual to explode (\|X\| 0.8 → 160K across 61 layers). Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res	2026-05-31 10:02:57 +00:00
biondizzle	0346e479d4	Add system prompt, CLI args, inverse RoPE flag, minimal e2e test - System prompt added via chat template (reasoning model needs instructions) - MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens) - --no-inverse-rope flag to test without inverse RoPE on attn output - --skip-moe flag to debug with shared expert only - --max-tokens and --prompt CLI overrides - minimal_e2e_test(): processes 'The' through full model, checks logits, tracks per-layer residual stream evolution, reports NaN/Inf/spread - INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims, first 448 always un-RoPE'd, relative encoding may be intentional	2026-05-31 09:56:18 +00:00
biondizzle	33004dcbf4	Fix expert weight broadcasting (wt.item() for scalar multiply)	2026-05-31 09:22:27 +00:00
biondizzle	a2ee78b564	Fix RoPE shape bug (interleave needs separate even/odd assembly)	2026-05-31 09:15:59 +00:00
biondizzle	9d96c2fbbf	CRITICAL FIX: FP32 RoPE cache + FP32 arithmetic for inverse RoPE round-trip BF16 cos/sin cache destroys cos²+sin²=1 identity (can be 0.996 in BF16). This causes ~3% error per RoPE→inverse RoPE round-trip, accumulating across 61 layers into garbage output. FP32 cache + FP32 arithmetic gives exact round-trip (diff < 1e-7). Also fixes: MoE expert loop indentation (was only running last expert).	2026-05-31 09:14:59 +00:00

1 2 3

113 Commits