Commit Graph

2330 Commits

Author SHA1 Message Date
2866eb92e7 Fix W_gate device: ensure .to(dev) after transpose 2026-06-03 12:56:52 +00:00
bd10bdbbd9 Fix router gate W_gate shape: must be (H, E) not (E, H)
dense_router_dispatch expects W_gate as (hidden, experts) and does
W_gate.T internally. dequant_nvfp4 returns (out, in) = (E, H), so
we need to transpose.
2026-06-03 12:42:17 +00:00
dc5a24687e Switch router gate from NVFP4 to BF16 (dequantize)
Dequantize NVFP4 gate weight to BF16 for router. Wrong top-k expert
selection from NVFP4 quantization noise is much worse than a small GEMM
error — one wrong expert poisons the whole token.

Also: lm_head already BF16 from previous commit.
2026-06-03 12:30:34 +00:00
cfea22cd6f Update PyTorch reference with official DSV4 encoding + batched prefill
Only template/tokenizer/parser changes — no SwiGLU or compressor fixes.
Official encoding module, chat mode, batched prefill, stop set,
official parser for structured output.
2026-06-03 12:19:38 +00:00
bdd9ab9669 Switch lm_head from NVFP4 to BF16 GEMM
Checkpoint lm_head.weight is BF16 (129280x7168). NVFP4 quantization
of lm_head can flatten/reorder vocabulary logits — wrong experts in
the final projection directly corrupt token selection. BF16 preserves
the full logit distribution for accurate token selection.

Tagged pure-nvfp4 before this change for rollback.
2026-06-03 11:37:41 +00:00
3320abfe24 Fix two correctness bugs: compressor pos bias on KV + SwiGLU clamp ordering
1. Compressor positional bias was being added to BOTH gate (softmax logit)
   AND KV content. Per paper eq. 9-12, position bias is only for the
   softmax logits (Z+B), NOT the KV content (C). Adding pb to kv_val
   corrupts every compressed KV entry with learned positional-bias content.
   Fixed in both CSA and HCA paths in compressor_reduce.cu.

2. SwiGLU clamp ordering: code was clamping silu(gate) instead of clamping
   raw gate before SiLU. Per paper §4.2.3: gate = clamp(gate, max=limit),
   then silu(clamp(gate)) * clamp(up). Fixed in moe.py (both unfused
   paths) and fused_swiglu.py (CuTeDSL kernel). shared_expert.py was
   already correct.
pure-nvfp4
2026-06-03 11:17:49 +00:00
7901470e63 doc clean up v-official-encoding-path 2026-06-03 10:53:41 +00:00
ca7c309463 Add reference/ dir: vLLM tokenizers, reasoning parsers, tool parsers, official inference
- reference/vllm/tokenizers/ — official DSV4 tokenizer + encoding (read-only)
- reference/vllm/reasoning/ — thinking mode parsers (DeepSeekR1 style )
- reference/vllm/tool_parsers/ — DSML tool call parsers (V3.2 base, V4 variant)
- reference/official_inference/ — original weight's generate.py, model.py, kernel.py
- reference/README.md documents the layout and which files matter for our pipeline
- These are read-only references for cross-checking, not imported by production code
2026-06-03 10:25:23 +00:00
8cfc1cae58 Canonical encoding: derive special token IDs from official encoding module + tokenizer
- Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs
- Import token strings from encoding.deepseek_v4_encoding (official source)
- Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime
- Use parse_message_from_completion_text() for structured output parsing
- No more hand-rolled prompt construction or hardcoded token IDs
- Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference
2026-06-03 10:23:02 +00:00
a86d6d90a5 Replace hand-rolled prompt with official DSV4 encoder (canonical path)
- Copied deepseek_v4_encoding.py from vLLM tree to encoding/
- Replaced hand-rolled prompt construction with encode_messages()
- --chat-mode → --thinking-mode (thinking|chat)
- The official encoder handles: BOS, User/Assistant tokens, thinking mode,
  tool calls, and all special token placement. It can't drift.
- This is the same code path inference engines will use.
2026-06-03 09:59:05 +00:00
284fc9ca86 Fix: thread comp_rope_cos/comp_rope_sin through forward_attention
Previous commit added params to forward_layer but forward_attention
(where compressed RoPE is applied) didn't receive them, causing NameError.

Also confirmed from B200 test output: compress_rope_theta=160000 vs
rope_theta=10000 — a 16x difference. The separate cache is essential.
2026-06-03 09:30:57 +00:00
6a3374da18 Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through
- Fixed comp_pos: (bi*r) block-aligned instead of ((bi+1)*r-1) last-position
- compress_rope_theta: separate rope cache for compressed KV entries
- comp_rope_cos/comp_rope_sin wired to all forward_layer call sites
  (prefill chunk loop, decode loop, CUDAGraphDecoder capture)
- forward_layer uses comp_rope caches for compressed RoPE, falls back to normal
- Only single_shot_inference.py modified, no kernel code touched
2026-06-03 09:19:11 +00:00
5003e756e2 WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support
- CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)*r-1))
  instead of FIRST position (bi*r). Off by r-1: 3 for CSA, 127 for HCA.
  vLLM uses (position // ratio) * ratio = block-aligned first position.
- Added compress_rope_theta config support (vLLM uses separate theta for compressed)
- Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through)

Only single_shot_inference.py changed — no kernel code touched.
Base commit: 572bdd2
2026-06-03 09:17:54 +00:00
572bdd2840 auto: pre-test commit 2026-06-03 09:01:02 +00:00
3c06fd5591 Test 2: fix topk tensor shape (flatten before iterating) 2026-06-03 08:47:32 +00:00
89f6e64057 README: document test harness gotchas (timeout arg, stale procs, screen names) 2026-06-03 08:36:02 +00:00
29d6986dd4 Test 2: fix quantize_to_nvfp4 import 2026-06-03 08:21:39 +00:00
60b9bbd470 Test 2: fix import - use mHCLayer from dsv4.layers.mhc, fixed prompt encoding 2026-06-03 08:20:21 +00:00
1e77dfcaa0 Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode 2026-06-03 08:19:33 +00:00
2a42686e8e Test 1 v2: diff hand-rolled vs official DSV4 encoding 2026-06-03 08:18:56 +00:00
11c2d5fe53 Add degeneration test 2: falsify mHC residual growth root cause 2026-06-03 08:18:01 +00:00
c77b83fffc Add degeneration test 1: chat-template token-ID diff 2026-06-03 08:17:09 +00:00
c5a131c358 more doc clean up again 2026-06-03 08:14:07 +00:00
019a3a34b7 Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md
Batched prefill + T>128 chunking now complete. All dangling items in
FINAL_STRETCH.md are marked done.
2026-06-03 08:12:54 +00:00
5e09be08af Fix non-contiguous tensor in quantize_nvfp4_gpu_fused (T>1 prefill)
The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.

Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.
2026-06-03 07:56:19 +00:00
60309ef124 Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing
- Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint)
- Each chunk goes through ALL 61 layers before the next chunk
- KV cache append_swa, compressor, indexer all already support T>1
- FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1
- For T>128: splits into multiple launches automatically
- mHC, Router, MoE, Nvfp4Linear all handle M>1 natively
- Eliminates ~N_prefill * 61 per-token overhead from the old loop
2026-06-03 07:39:37 +00:00
0bf276f8c9 more doc cleanup 2026-06-03 07:37:13 +00:00
d463ac8512 doc cleanup 2026-06-03 07:34:12 +00:00
7450ebc67a CORRECTNESS_BACKLOG.md: comprehensive production pipeline verification results — all tested and confirmed findings from PART A diagnostics 2026-06-03 07:31:01 +00:00
9dbfac9dfa PART A: verify kv_norm_w loaded correctly 2026-06-03 07:03:39 +00:00
a682c6adf4 PART A: add raw compressor output diagnostic 2026-06-03 06:56:56 +00:00
f2c1b3afd5 PART A: fix KV diagnostics — compute q_a before indexer, add Q_heads magnitude check 2026-06-03 06:33:51 +00:00
86e59c16c5 PART A: add KV gather diagnostics at blowup layer 2026-06-03 06:25:35 +00:00
262f844e2e PART A: add detailed blowup diagnostics — capture mHC intermediate values when |X| > 1e6 2026-06-03 06:10:33 +00:00
6459fbca9a fix: import forward_attention 2026-06-03 05:41:33 +00:00
91dfac34d8 PART A: simplified to production-only diagnostics — track per-layer |X| during prefill and decode, detect blowup early 2026-06-03 05:33:22 +00:00
d99503732d fix: add BF16 gate weight fallback for dense routers (missing from test) 2026-06-03 05:22:47 +00:00
801bfc9a83 add router mode debug print 2026-06-03 05:15:52 +00:00
b385ecc05e PART A: decode diagnostics test — production vs reference per-layer X comparison at decode step 2026-06-03 05:06:40 +00:00
d518fcb82a test: correct sink bias reference — denominator-only, no V contribution 2026-06-03 04:57:37 +00:00
9574a9dc2e test: add sink bias to reference SDPA in decode FMHA comparison 2026-06-03 04:53:55 +00:00
9a9b347b2b test: add per-head magnitude ratio diagnostics to decode FMHA test 2026-06-03 04:50:23 +00:00
f5fa20c581 fix: syntax error — missing closing paren in indexer.forward call 2026-06-03 04:46:41 +00:00
693975ec92 fix: device mismatches in decode FMHA test — dec_pos must be on per-layer GPU 2026-06-03 04:46:24 +00:00
e1d96c509d test: decode FMHA layer comparison — checks FMHA accuracy during decode step 2026-06-03 04:39:12 +00:00
1ebe7f0dde Add PART_A_NEXT_SESSION.md: clues for decode degeneration debugging 2026-06-03 04:34:28 +00:00
d8306be3f2 Fix PART A test: proper FP8 quantization and MQA reference 2026-06-03 04:20:36 +00:00
4126909dfb Simplify PART A test: compressor + FMHA at production scale 2026-06-03 04:18:13 +00:00
8c54cfa748 Fix KVCache init in PART A test 2026-06-03 04:15:41 +00:00
04cf8ca848 Add PART A diagnostic tests: compressor + KV cache + FMHA at production scale 2026-06-03 04:13:53 +00:00