dense_router_dispatch expects W_gate as (hidden, experts) and does
W_gate.T internally. dequant_nvfp4 returns (out, in) = (E, H), so
we need to transpose.
Dequantize NVFP4 gate weight to BF16 for router. Wrong top-k expert
selection from NVFP4 quantization noise is much worse than a small GEMM
error — one wrong expert poisons the whole token.
Also: lm_head already BF16 from previous commit.
Only template/tokenizer/parser changes — no SwiGLU or compressor fixes.
Official encoding module, chat mode, batched prefill, stop set,
official parser for structured output.
Checkpoint lm_head.weight is BF16 (129280x7168). NVFP4 quantization
of lm_head can flatten/reorder vocabulary logits — wrong experts in
the final projection directly corrupt token selection. BF16 preserves
the full logit distribution for accurate token selection.
Tagged pure-nvfp4 before this change for rollback.
1. Compressor positional bias was being added to BOTH gate (softmax logit)
AND KV content. Per paper eq. 9-12, position bias is only for the
softmax logits (Z+B), NOT the KV content (C). Adding pb to kv_val
corrupts every compressed KV entry with learned positional-bias content.
Fixed in both CSA and HCA paths in compressor_reduce.cu.
2. SwiGLU clamp ordering: code was clamping silu(gate) instead of clamping
raw gate before SiLU. Per paper §4.2.3: gate = clamp(gate, max=limit),
then silu(clamp(gate)) * clamp(up). Fixed in moe.py (both unfused
paths) and fused_swiglu.py (CuTeDSL kernel). shared_expert.py was
already correct.
- Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs
- Import token strings from encoding.deepseek_v4_encoding (official source)
- Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime
- Use parse_message_from_completion_text() for structured output parsing
- No more hand-rolled prompt construction or hardcoded token IDs
- Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference
- Copied deepseek_v4_encoding.py from vLLM tree to encoding/
- Replaced hand-rolled prompt construction with encode_messages()
- --chat-mode → --thinking-mode (thinking|chat)
- The official encoder handles: BOS, User/Assistant tokens, thinking mode,
tool calls, and all special token placement. It can't drift.
- This is the same code path inference engines will use.
Previous commit added params to forward_layer but forward_attention
(where compressed RoPE is applied) didn't receive them, causing NameError.
Also confirmed from B200 test output: compress_rope_theta=160000 vs
rope_theta=10000 — a 16x difference. The separate cache is essential.
- Fixed comp_pos: (bi*r) block-aligned instead of ((bi+1)*r-1) last-position
- compress_rope_theta: separate rope cache for compressed KV entries
- comp_rope_cos/comp_rope_sin wired to all forward_layer call sites
(prefill chunk loop, decode loop, CUDAGraphDecoder capture)
- forward_layer uses comp_rope caches for compressed RoPE, falls back to normal
- Only single_shot_inference.py modified, no kernel code touched
- CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)*r-1))
instead of FIRST position (bi*r). Off by r-1: 3 for CSA, 127 for HCA.
vLLM uses (position // ratio) * ratio = block-aligned first position.
- Added compress_rope_theta config support (vLLM uses separate theta for compressed)
- Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through)
Only single_shot_inference.py changed — no kernel code touched.
Base commit: 572bdd2
The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.
Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.
- Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint)
- Each chunk goes through ALL 61 layers before the next chunk
- KV cache append_swa, compressor, indexer all already support T>1
- FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1
- For T>128: splits into multiple launches automatically
- mHC, Router, MoE, Nvfp4Linear all handle M>1 natively
- Eliminates ~N_prefill * 61 per-token overhead from the old loop