nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2866eb92e7	Fix W_gate device: ensure .to(dev) after transpose	2026-06-03 12:56:52 +00:00
biondizzle	bd10bdbbd9	Fix router gate W_gate shape: must be (H, E) not (E, H) dense_router_dispatch expects W_gate as (hidden, experts) and does W_gate.T internally. dequant_nvfp4 returns (out, in) = (E, H), so we need to transpose.	2026-06-03 12:42:17 +00:00
biondizzle	dc5a24687e	Switch router gate from NVFP4 to BF16 (dequantize) Dequantize NVFP4 gate weight to BF16 for router. Wrong top-k expert selection from NVFP4 quantization noise is much worse than a small GEMM error — one wrong expert poisons the whole token. Also: lm_head already BF16 from previous commit.	2026-06-03 12:30:34 +00:00
biondizzle	cfea22cd6f	Update PyTorch reference with official DSV4 encoding + batched prefill Only template/tokenizer/parser changes — no SwiGLU or compressor fixes. Official encoding module, chat mode, batched prefill, stop set, official parser for structured output.	2026-06-03 12:19:38 +00:00
biondizzle	bdd9ab9669	Switch lm_head from NVFP4 to BF16 GEMM Checkpoint lm_head.weight is BF16 (129280x7168). NVFP4 quantization of lm_head can flatten/reorder vocabulary logits — wrong experts in the final projection directly corrupt token selection. BF16 preserves the full logit distribution for accurate token selection. Tagged pure-nvfp4 before this change for rollback.	2026-06-03 11:37:41 +00:00
biondizzle	3320abfe24	Fix two correctness bugs: compressor pos bias on KV + SwiGLU clamp ordering 1. Compressor positional bias was being added to BOTH gate (softmax logit) AND KV content. Per paper eq. 9-12, position bias is only for the softmax logits (Z+B), NOT the KV content (C). Adding pb to kv_val corrupts every compressed KV entry with learned positional-bias content. Fixed in both CSA and HCA paths in compressor_reduce.cu. 2. SwiGLU clamp ordering: code was clamping silu(gate) instead of clamping raw gate before SiLU. Per paper §4.2.3: gate = clamp(gate, max=limit), then silu(clamp(gate)) * clamp(up). Fixed in moe.py (both unfused paths) and fused_swiglu.py (CuTeDSL kernel). shared_expert.py was already correct. pure-nvfp4	2026-06-03 11:17:49 +00:00
biondizzle	7901470e63	doc clean up v-official-encoding-path	2026-06-03 10:53:41 +00:00
biondizzle	ca7c309463	Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference - reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only) - reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style ) - reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant) - reference/official_inference/ — original weight's generate.py, model.py, kernel.py - reference/README.md documents the layout and which files matter for our pipeline - These are read-only references for cross-checking, not imported by production code	2026-06-03 10:25:23 +00:00
biondizzle	8cfc1cae58	Canonical encoding: derive special token IDs from official encoding module + tokenizer - Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs - Import token strings from encoding.deepseek_v4_encoding (official source) - Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime - Use parse_message_from_completion_text() for structured output parsing - No more hand-rolled prompt construction or hardcoded token IDs - Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference	2026-06-03 10:23:02 +00:00
biondizzle	a86d6d90a5	Replace hand-rolled prompt with official DSV4 encoder (canonical path) - Copied deepseek_v4_encoding.py from vLLM tree to encoding/ - Replaced hand-rolled prompt construction with encode_messages() - --chat-mode → --thinking-mode (thinking\|chat) - The official encoder handles: BOS, User/Assistant tokens, thinking mode, tool calls, and all special token placement. It can't drift. - This is the same code path inference engines will use.	2026-06-03 09:59:05 +00:00
biondizzle	284fc9ca86	Fix: thread comp_rope_cos/comp_rope_sin through forward_attention Previous commit added params to forward_layer but forward_attention (where compressed RoPE is applied) didn't receive them, causing NameError. Also confirmed from B200 test output: compress_rope_theta=160000 vs rope_theta=10000 — a 16x difference. The separate cache is essential.	2026-06-03 09:30:57 +00:00
biondizzle	6a3374da18	Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through - Fixed comp_pos: (bir) block-aligned instead of ((bi+1)r-1) last-position - compress_rope_theta: separate rope cache for compressed KV entries - comp_rope_cos/comp_rope_sin wired to all forward_layer call sites (prefill chunk loop, decode loop, CUDAGraphDecoder capture) - forward_layer uses comp_rope caches for compressed RoPE, falls back to normal - Only single_shot_inference.py modified, no kernel code touched	2026-06-03 09:19:11 +00:00
biondizzle	5003e756e2	WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support - CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)r-1)) instead of FIRST position (bir). Off by r-1: 3 for CSA, 127 for HCA. vLLM uses (position // ratio) * ratio = block-aligned first position. - Added compress_rope_theta config support (vLLM uses separate theta for compressed) - Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through) Only single_shot_inference.py changed — no kernel code touched. Base commit: `572bdd2`	2026-06-03 09:17:54 +00:00
biondizzle	572bdd2840	auto: pre-test commit	2026-06-03 09:01:02 +00:00
biondizzle	3c06fd5591	Test 2: fix topk tensor shape (flatten before iterating)	2026-06-03 08:47:32 +00:00
biondizzle	89f6e64057	README: document test harness gotchas (timeout arg, stale procs, screen names)	2026-06-03 08:36:02 +00:00
biondizzle	29d6986dd4	Test 2: fix quantize_to_nvfp4 import	2026-06-03 08:21:39 +00:00
biondizzle	60b9bbd470	Test 2: fix import - use mHCLayer from dsv4.layers.mhc, fixed prompt encoding	2026-06-03 08:20:21 +00:00
biondizzle	1e77dfcaa0	Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode	2026-06-03 08:19:33 +00:00
biondizzle	2a42686e8e	Test 1 v2: diff hand-rolled vs official DSV4 encoding	2026-06-03 08:18:56 +00:00
biondizzle	11c2d5fe53	Add degeneration test 2: falsify mHC residual growth root cause	2026-06-03 08:18:01 +00:00
biondizzle	c77b83fffc	Add degeneration test 1: chat-template token-ID diff	2026-06-03 08:17:09 +00:00
biondizzle	c5a131c358	more doc clean up again	2026-06-03 08:14:07 +00:00
biondizzle	019a3a34b7	Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md Batched prefill + T>128 chunking now complete. All dangling items in FINAL_STRETCH.md are marked done.	2026-06-03 08:12:54 +00:00
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	60309ef124	Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing - Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint) - Each chunk goes through ALL 61 layers before the next chunk - KV cache append_swa, compressor, indexer all already support T>1 - FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1 - For T>128: splits into multiple launches automatically - mHC, Router, MoE, Nvfp4Linear all handle M>1 natively - Eliminates ~N_prefill * 61 per-token overhead from the old loop	2026-06-03 07:39:37 +00:00
biondizzle	0bf276f8c9	more doc cleanup	2026-06-03 07:37:13 +00:00
biondizzle	d463ac8512	doc cleanup	2026-06-03 07:34:12 +00:00
biondizzle	7450ebc67a	CORRECTNESS_BACKLOG.md: comprehensive production pipeline verification results — all tested and confirmed findings from PART A diagnostics	2026-06-03 07:31:01 +00:00
biondizzle	9dbfac9dfa	PART A: verify kv_norm_w loaded correctly	2026-06-03 07:03:39 +00:00
biondizzle	a682c6adf4	PART A: add raw compressor output diagnostic	2026-06-03 06:56:56 +00:00
biondizzle	f2c1b3afd5	PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check	2026-06-03 06:33:51 +00:00
biondizzle	86e59c16c5	PART A: add KV gather diagnostics at blowup layer	2026-06-03 06:25:35 +00:00
biondizzle	262f844e2e	PART A: add detailed blowup diagnostics — capture mHC intermediate values when \|X\| > 1e6	2026-06-03 06:10:33 +00:00
biondizzle	6459fbca9a	fix: import forward_attention	2026-06-03 05:41:33 +00:00
biondizzle	91dfac34d8	PART A: simplified to production-only diagnostics — track per-layer \|X\| during prefill and decode, detect blowup early	2026-06-03 05:33:22 +00:00
biondizzle	d99503732d	fix: add BF16 gate weight fallback for dense routers (missing from test)	2026-06-03 05:22:47 +00:00
biondizzle	801bfc9a83	add router mode debug print	2026-06-03 05:15:52 +00:00
biondizzle	b385ecc05e	PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step	2026-06-03 05:06:40 +00:00
biondizzle	d518fcb82a	test: correct sink bias reference — denominator-only, no V contribution	2026-06-03 04:57:37 +00:00
biondizzle	9574a9dc2e	test: add sink bias to reference SDPA in decode FMHA comparison	2026-06-03 04:53:55 +00:00
biondizzle	9a9b347b2b	test: add per-head magnitude ratio diagnostics to decode FMHA test	2026-06-03 04:50:23 +00:00
biondizzle	f5fa20c581	fix: syntax error — missing closing paren in indexer.forward call	2026-06-03 04:46:41 +00:00
biondizzle	693975ec92	fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU	2026-06-03 04:46:24 +00:00
biondizzle	e1d96c509d	test: decode FMHA layer comparison — checks FMHA accuracy during decode step	2026-06-03 04:39:12 +00:00
biondizzle	1ebe7f0dde	Add PART_A_NEXT_SESSION.md: clues for decode degeneration debugging	2026-06-03 04:34:28 +00:00
biondizzle	d8306be3f2	Fix PART A test: proper FP8 quantization and MQA reference	2026-06-03 04:20:36 +00:00
biondizzle	4126909dfb	Simplify PART A test: compressor + FMHA at production scale	2026-06-03 04:18:13 +00:00
biondizzle	8c54cfa748	Fix KVCache init in PART A test	2026-06-03 04:15:41 +00:00
biondizzle	04cf8ca848	Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale	2026-06-03 04:13:53 +00:00

1 2 3 4 5 ...

2330 Commits