- Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs
- Import token strings from encoding.deepseek_v4_encoding (official source)
- Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime
- Use parse_message_from_completion_text() for structured output parsing
- No more hand-rolled prompt construction or hardcoded token IDs
- Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference
- Copied deepseek_v4_encoding.py from vLLM tree to encoding/
- Replaced hand-rolled prompt construction with encode_messages()
- --chat-mode → --thinking-mode (thinking|chat)
- The official encoder handles: BOS, User/Assistant tokens, thinking mode,
tool calls, and all special token placement. It can't drift.
- This is the same code path inference engines will use.
Previous commit added params to forward_layer but forward_attention
(where compressed RoPE is applied) didn't receive them, causing NameError.
Also confirmed from B200 test output: compress_rope_theta=160000 vs
rope_theta=10000 — a 16x difference. The separate cache is essential.
- Fixed comp_pos: (bi*r) block-aligned instead of ((bi+1)*r-1) last-position
- compress_rope_theta: separate rope cache for compressed KV entries
- comp_rope_cos/comp_rope_sin wired to all forward_layer call sites
(prefill chunk loop, decode loop, CUDAGraphDecoder capture)
- forward_layer uses comp_rope caches for compressed RoPE, falls back to normal
- Only single_shot_inference.py modified, no kernel code touched
- CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)*r-1))
instead of FIRST position (bi*r). Off by r-1: 3 for CSA, 127 for HCA.
vLLM uses (position // ratio) * ratio = block-aligned first position.
- Added compress_rope_theta config support (vLLM uses separate theta for compressed)
- Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through)
Only single_shot_inference.py changed — no kernel code touched.
Base commit: 572bdd2
- Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint)
- Each chunk goes through ALL 61 layers before the next chunk
- KV cache append_swa, compressor, indexer all already support T>1
- FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1
- For T>128: splits into multiple launches automatically
- mHC, Router, MoE, Nvfp4Linear all handle M>1 natively
- Eliminates ~N_prefill * 61 per-token overhead from the old loop
- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
- Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
- Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
- TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
- On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
- No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
- Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)
- Modified: single_shot_inference.py
- Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
- Consumes FP8 indexer keys from cache without BF16 dequantization
- Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
- FP32 einsum fallback retained only for T>1 (prefill)
- Removed 'Intentional first-pass limits' section from B1 doc
(those limits ARE the correct production design, not shortcuts)
DSV4 is a reasoning model. The standard prompt format is:
BOS <|User|> prompt <|Assistant|> ◇
Without the ◇ priming, the model is out-of-distribution — it expects to
be inside a thinking block but never received the sentinel. This causes
degenerate output from step 0 (France instead of Paris, looping on
newlines/repeated tokens).
With ◇, the model will:
1. Generate thinking content (reasoning)
2. Emit ◇ (think_end=128822) to close the thinking block
3. Produce the actual answer
4. Emit EOS (token 1)
This matches the pattern described in the Kimi K2 accuracy blog:
https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed
prompt formatting is the #1 cause of degenerate output in chat-tuned
reasoning models.
Previously only stopped on tokenizer.eos_token_id. DSV4 uses special
turn-end tokens (<|end_of_sentence|>, USER_TOKEN=128803) that indicate
the assistant turn is complete. Missing these caused decode to continue
past the model's natural stopping point, producing degenerate output.
Also increased diagnostic logging (every step for first 20 steps) to
catch turn-end token emissions.
- Added --ab-compare flag to run both fused and unfused paths for first 3 layers
- Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv)
- Added --no-fused-rmsnorm to disable P4 and use unfused path
- This will help diagnose the correctness regression introduced by P4
We tried NVFP4 (Blackwell native FP4→MMA). Three approaches.
cos=0.995 round-trip seems fine in isolation but 4.5 effective bits
compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective
bits gives cos=0.9997 — that 0.4% difference is the margin between
working and broken. Kernels exist, path is proven, precision isn't.
_apply_rope now uses dsv4.ops.rope_cuda (1 CUDA kernel per call)
instead of PyTorch ops (5-6 kernels per call).
Total: 183 RoPE calls × (5-1) = 732 launches saved per token.
With fallback to PyTorch if CUDA kernel fails.
P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers
'TypeError: too many positional arguments' during cute.compile(). The kernel
signature doesn't match the positional args being passed. This is a kernel-side
fix, not a single_shot fix. Disabled until the fused kernel is debugged.
P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup.
SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved
weight path) is wired but disabled. Will activate once kernel fix lands.
C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from
(max_context + compress_ratio - 1) // ratio per layer type.
CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries.
No more hardcoded 65536 that wastes memory on HCA layers.
C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache.
Gather writes compressed+SWA into this buffer via slice assignment.
Zero torch.cat allocations on the hot decode path.
C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed
views. Caller copies into gather_buf so no aliasing risk.
The indexer silently returning None caused CSA layers to attend over only the
SWA window (128 tokens), not the compressed sparse KV. This went undetected
because the model still produced plausible output at short context. The assert
makes any future indexer regression immediately visible.
1. Indexer.load: weights at *.indexer.kv_proj not *.indexer.compressor.kv_proj
2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim
3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd);
einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads
(paper's c_I = ihd = 128, one vector per compressed block)
Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP,
RESHAPE FAILURE, indexer load state) — served their purpose.
- grouped_linear.py: Replace .item() gsa + Python quantize with
quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups
into (G*T, D), single fused kernel launch, GPU-only gsa copy.
- single_shot_inference.py: Reduce torch.cuda.synchronize() to
every 20 steps instead of every step. Gate per-layer diagnostics
to li<3 or li>=58 (avoid 61 .item() calls per decode step).
NVFP4 quantize_from_buffer produces CUDA error on large-magnitude
inputs (|X|>500 at L60 output). BF16 lm_head is correct and only
runs once per decode step — not a bottleneck.
TODO: debug the NVFP4 path for large activations and re-enable.