- Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs
- Import token strings from encoding.deepseek_v4_encoding (official source)
- Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime
- Use parse_message_from_completion_text() for structured output parsing
- No more hand-rolled prompt construction or hardcoded token IDs
- Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference
- Copied deepseek_v4_encoding.py from vLLM tree to encoding/
- Replaced hand-rolled prompt construction with encode_messages()
- --chat-mode → --thinking-mode (thinking|chat)
- The official encoder handles: BOS, User/Assistant tokens, thinking mode,
tool calls, and all special token placement. It can't drift.
- This is the same code path inference engines will use.
Previous commit added params to forward_layer but forward_attention
(where compressed RoPE is applied) didn't receive them, causing NameError.
Also confirmed from B200 test output: compress_rope_theta=160000 vs
rope_theta=10000 — a 16x difference. The separate cache is essential.
- Fixed comp_pos: (bi*r) block-aligned instead of ((bi+1)*r-1) last-position
- compress_rope_theta: separate rope cache for compressed KV entries
- comp_rope_cos/comp_rope_sin wired to all forward_layer call sites
(prefill chunk loop, decode loop, CUDAGraphDecoder capture)
- forward_layer uses comp_rope caches for compressed RoPE, falls back to normal
- Only single_shot_inference.py modified, no kernel code touched
- CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)*r-1))
instead of FIRST position (bi*r). Off by r-1: 3 for CSA, 127 for HCA.
vLLM uses (position // ratio) * ratio = block-aligned first position.
- Added compress_rope_theta config support (vLLM uses separate theta for compressed)
- Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through)
Only single_shot_inference.py changed — no kernel code touched.
Base commit: 572bdd2
The intermediate tensor from fused SwiGLU deinterleave is a column slice
(non-contiguous). When T>1, quantize_nvfp4_gpu_fused receives this and
the CUDA kernel crashes with 'input must be contiguous'.
Fix: add is_contiguous() check + .contiguous() in quantize_nvfp4_gpu_fused
and in SharedExpert._run_l2. This is the root cause, not a workaround —
CUDA kernels legitimately require contiguous memory.
- Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint)
- Each chunk goes through ALL 61 layers before the next chunk
- KV cache append_swa, compressor, indexer all already support T>1
- FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1
- For T>128: splits into multiple launches automatically
- mHC, Router, MoE, Nvfp4Linear all handle M>1 natively
- Eliminates ~N_prefill * 61 per-token overhead from the old loop
The kernel was using head strides for the T (query row) dimension,
which happened to work for T=1 (qr=0 always) but was wrong for T>1.
For (B,H,T,NOPE) layout:
- Head stride = T*NOPE, but T stride = NOPE
- Scale head stride = T, but T stride = 1
- RoPE head stride = T*ROPE, but T stride = ROPE
Added q_nope_t_stride, q_scale_t_stride, q_rope_t_stride to params
struct, C API, and Python wrapper.
Replace complex n_sub-iterating read with the same HD/8 iteration
pattern as the proven decode kernel. Extract from lane qr%32 instead
of always lane 0. For qr>=32, use warp 1; for qr>=64, add TMEM offset.
This should fix the row 1 accuracy issue (was cos=0.94 vs decode).