nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	1121cd7b47	Add CUDA_LAUNCH_BLOCKING=1 to catch async errors	2026-06-03 14:48:51 +00:00
biondizzle	f3bb0ca08c	Fix dequant gsa: use ws2 only, NOT input_scale * ws2 For weight dequantization, gsa should be weight_scale_2 only. input_scale is the activation global scale — it belongs on the GEMM's activation side, not the weight side. Using input_scale * ws2 gave gsa = 6e-8 (essentially zero), making dequantized weights ~0. The GEMM formula is y = (x * scale_a * gsa) @ (w * scale_b * gsb) where gsb = input_scale * ws2. But dequantize_nvfp4 is just the weight half: w_bf16 = lut[w] * block_scale * ws2.	2026-06-03 14:38:24 +00:00
biondizzle	470e65fb19	Fix dequant gsb: input_scale * ws2, not 1.0 * ws2 The NVFP4 dequantize formula is w = lut[w_packed] * scale * ws2, and in the GEMM the global_scale_b = input_scale * ws2. Was incorrectly using gsb = 1.0 * ws2 (missing input_scale). This would produce wrongly-scaled BF16 weights from dequantize_nvfp4.	2026-06-03 14:26:59 +00:00
biondizzle	2dd16d5789	Switch compressor + indexer weights_proj to BF16 F.linear Only the CSA indexer QK path (q_b_proj) is explicitly FP4-QATed. The rest of the compressor/indexer projections are NOT, so use BF16: - Compressor kv_proj, gate_proj: dequantize NVFP4 → BF16, F.linear - Indexer weights_proj: dequantize NVFP4 → BF16, F.linear - Indexer q_b_proj: KEEP as NVFP4 (this IS the FP4-QATed path) - Indexer compressor: inherits Compressor's BF16 path	2026-06-03 14:19:41 +00:00
biondizzle	95e45a87e3	Add explicit .to(dev) on W_gate after transpose — belt and suspenders	2026-06-03 14:17:02 +00:00
biondizzle	ef94c48957	Simplify router gate: dequant NVFP4 → BF16, F.linear (no FP8 middleman) Same as what worked before. The checkpoint stores NVFP4 weights, so we dequantize once at load time and use cuBLAS F.linear. No FP8 re-quantize step needed — that was just adding noise on top of the NVFP4 dequant.	2026-06-03 14:14:10 +00:00
biondizzle	715602c87c	Switch lm_head to BF16 + router gate to FP8_E4M3 lm_head: BF16 F.linear (checkpoint weight is BF16, no quantization) Router gate: FP8_E4M3 quantize→dequantize round-trip, then F.linear - Dequantize NVFP4 checkpoint weights to BF16 first - Quantize to FP8_E4M3 (scale = amax/448) - Dequantize back to BF16 for F.linear - Uses BF16 dispatch path in dense_router_dispatch - Simpler scale wiring than NVFP4 (single per-tensor scale)	2026-06-03 14:10:28 +00:00
biondizzle	7901470e63	doc clean up v-official-encoding-path	2026-06-03 10:53:41 +00:00
biondizzle	ca7c309463	Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference - reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only) - reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style ) - reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant) - reference/official_inference/ — original weight's generate.py, model.py, kernel.py - reference/README.md documents the layout and which files matter for our pipeline - These are read-only references for cross-checking, not imported by production code	2026-06-03 10:25:23 +00:00
biondizzle	8cfc1cae58	Canonical encoding: derive special token IDs from official encoding module + tokenizer - Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs - Import token strings from encoding.deepseek_v4_encoding (official source) - Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime - Use parse_message_from_completion_text() for structured output parsing - No more hand-rolled prompt construction or hardcoded token IDs - Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference	2026-06-03 10:23:02 +00:00
biondizzle	a86d6d90a5	Replace hand-rolled prompt with official DSV4 encoder (canonical path) - Copied deepseek_v4_encoding.py from vLLM tree to encoding/ - Replaced hand-rolled prompt construction with encode_messages() - --chat-mode → --thinking-mode (thinking\|chat) - The official encoder handles: BOS, User/Assistant tokens, thinking mode, tool calls, and all special token placement. It can't drift. - This is the same code path inference engines will use.	2026-06-03 09:59:05 +00:00
biondizzle	284fc9ca86	Fix: thread comp_rope_cos/comp_rope_sin through forward_attention Previous commit added params to forward_layer but forward_attention (where compressed RoPE is applied) didn't receive them, causing NameError. Also confirmed from B200 test output: compress_rope_theta=160000 vs rope_theta=10000 — a 16x difference. The separate cache is essential.	2026-06-03 09:30:57 +00:00
biondizzle	6a3374da18	Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through - Fixed comp_pos: (bir) block-aligned instead of ((bi+1)r-1) last-position - compress_rope_theta: separate rope cache for compressed KV entries - comp_rope_cos/comp_rope_sin wired to all forward_layer call sites (prefill chunk loop, decode loop, CUDAGraphDecoder capture) - forward_layer uses comp_rope caches for compressed RoPE, falls back to normal - Only single_shot_inference.py modified, no kernel code touched	2026-06-03 09:19:11 +00:00
biondizzle	5003e756e2	WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support - CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)r-1)) instead of FIRST position (bir). Off by r-1: 3 for CSA, 127 for HCA. vLLM uses (position // ratio) * ratio = block-aligned first position. - Added compress_rope_theta config support (vLLM uses separate theta for compressed) - Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through) Only single_shot_inference.py changed — no kernel code touched. Base commit: `572bdd2`	2026-06-03 09:17:54 +00:00
biondizzle	572bdd2840	auto: pre-test commit	2026-06-03 09:01:02 +00:00
biondizzle	3c06fd5591	Test 2: fix topk tensor shape (flatten before iterating)	2026-06-03 08:47:32 +00:00
biondizzle	89f6e64057	README: document test harness gotchas (timeout arg, stale procs, screen names)	2026-06-03 08:36:02 +00:00
biondizzle	29d6986dd4	Test 2: fix quantize_to_nvfp4 import	2026-06-03 08:21:39 +00:00
biondizzle	60b9bbd470	Test 2: fix import - use mHCLayer from dsv4.layers.mhc, fixed prompt encoding	2026-06-03 08:20:21 +00:00
biondizzle	1e77dfcaa0	Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode	2026-06-03 08:19:33 +00:00
biondizzle	2a42686e8e	Test 1 v2: diff hand-rolled vs official DSV4 encoding	2026-06-03 08:18:56 +00:00
biondizzle	11c2d5fe53	Add degeneration test 2: falsify mHC residual growth root cause	2026-06-03 08:18:01 +00:00
biondizzle	c77b83fffc	Add degeneration test 1: chat-template token-ID diff	2026-06-03 08:17:09 +00:00
biondizzle	c5a131c358	more doc clean up again	2026-06-03 08:14:07 +00:00
biondizzle	019a3a34b7	Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md Batched prefill + T>128 chunking now complete. All dangling items in FINAL_STRETCH.md are marked done.	2026-06-03 08:12:54 +00:00
biondizzle	5e09be08af	Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill) The intermediate tensor from fused SwiGLU deinterleave is a column slice (non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and the CUDA kernel crashes with 'input must be contiguous'. Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused and in SharedExpert._run_l2. This is the root cause, not a workaround — CUDA kernels legitimately require contiguous memory.	2026-06-03 07:56:19 +00:00
biondizzle	60309ef124	Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing - Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint) - Each chunk goes through ALL 61 layers before the next chunk - KV cache append_swa, compressor, indexer all already support T>1 - FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1 - For T>128: splits into multiple launches automatically - mHC, Router, MoE, Nvfp4Linear all handle M>1 natively - Eliminates ~N_prefill * 61 per-token overhead from the old loop	2026-06-03 07:39:37 +00:00
biondizzle	0bf276f8c9	more doc cleanup	2026-06-03 07:37:13 +00:00
biondizzle	d463ac8512	doc cleanup	2026-06-03 07:34:12 +00:00
biondizzle	7450ebc67a	CORRECTNESS_BACKLOG.md: comprehensive production pipeline verification results — all tested and confirmed findings from PART A diagnostics	2026-06-03 07:31:01 +00:00
biondizzle	9dbfac9dfa	PART A: verify kv_norm_w loaded correctly	2026-06-03 07:03:39 +00:00
biondizzle	a682c6adf4	PART A: add raw compressor output diagnostic	2026-06-03 06:56:56 +00:00
biondizzle	f2c1b3afd5	PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check	2026-06-03 06:33:51 +00:00
biondizzle	86e59c16c5	PART A: add KV gather diagnostics at blowup layer	2026-06-03 06:25:35 +00:00
biondizzle	262f844e2e	PART A: add detailed blowup diagnostics — capture mHC intermediate values when \|X\| > 1e6	2026-06-03 06:10:33 +00:00
biondizzle	6459fbca9a	fix: import forward_attention	2026-06-03 05:41:33 +00:00
biondizzle	91dfac34d8	PART A: simplified to production-only diagnostics — track per-layer \|X\| during prefill and decode, detect blowup early	2026-06-03 05:33:22 +00:00
biondizzle	d99503732d	fix: add BF16 gate weight fallback for dense routers (missing from test)	2026-06-03 05:22:47 +00:00
biondizzle	801bfc9a83	add router mode debug print	2026-06-03 05:15:52 +00:00
biondizzle	b385ecc05e	PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step	2026-06-03 05:06:40 +00:00
biondizzle	d518fcb82a	test: correct sink bias reference — denominator-only, no V contribution	2026-06-03 04:57:37 +00:00
biondizzle	9574a9dc2e	test: add sink bias to reference SDPA in decode FMHA comparison	2026-06-03 04:53:55 +00:00
biondizzle	9a9b347b2b	test: add per-head magnitude ratio diagnostics to decode FMHA test	2026-06-03 04:50:23 +00:00
biondizzle	f5fa20c581	fix: syntax error — missing closing paren in indexer.forward call	2026-06-03 04:46:41 +00:00
biondizzle	693975ec92	fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU	2026-06-03 04:46:24 +00:00
biondizzle	e1d96c509d	test: decode FMHA layer comparison — checks FMHA accuracy during decode step	2026-06-03 04:39:12 +00:00
biondizzle	1ebe7f0dde	Add PART_A_NEXT_SESSION.md: clues for decode degeneration debugging	2026-06-03 04:34:28 +00:00
biondizzle	d8306be3f2	Fix PART A test: proper FP8 quantization and MQA reference	2026-06-03 04:20:36 +00:00
biondizzle	4126909dfb	Simplify PART A test: compressor + FMHA at production scale	2026-06-03 04:18:13 +00:00
biondizzle	8c54cfa748	Fix KVCache init in PART A test	2026-06-03 04:15:41 +00:00

1 2 3 4 5 ...

2331 Commits