- q_a is needed by the indexer in CSA layers
- When q_heads/kv_3d are provided (graph replay), the projection code is
skipped so q_a is never computed
- Fix: add q_a_bufs to CUDAGraphDecoder, write q_a during Graph A capture,
pass q_a as kwarg to forward_attention during graph replay
- Also: forward_attention now accepts q_a kwarg (default None)
The forward_attention() signature has no token_id parameter, but the
graph replay path was passing dec_tid32_per_gpu[gpu] between positions
and compressor — causing the int tensor to be interpreted as compressor
and triggering AttributeError: 'int' object has no attribute 'ratio'
CUDAGraphDecoder now splits each layer into two graph-captured regions
with eager attention in between:
Graph A (pre-attention): mHC pre_block + fused RMSNorm + quantize
+ q_a/q_b/kv projections
→ writes intermediates to pre-allocated buffers
Eager (attention): Compressor → Indexer → FMHA → o_proj
→ dynamic shapes, data-dependent control flow
Graph B (post-attention): mHC post_block + FFN + Router + MoE + SE
→ writes X_next to pre-allocated output buffer
The attention path has dynamic shapes (FMHA seq_len grows, compressor
returns None) and cannot be captured. The compute path has fixed shapes
for T=1 decode and CAN be captured.
Changes:
- CUDAGraphDecoder: 2 graphs per layer (A/B) + lm_head graph
- Pre-allocated intermediate buffers for graph A → eager → graph B boundary
- forward_attention: accepts optional q_heads/kv_3d to skip projections
- Replay loop: graph A → eager attention → graph B per layer
This replaces the single-graph-per-layer approach which failed at L1+
because the attention path contains data-dependent control flow and
dynamic shapes that cannot be captured.
Cross-GPU .to() calls inside graph capture cause 'dependency on uncaptured
work in another stream'. Fix: pass dec_pos_per_gpu/dec_tid32_per_gpu to
capture() so each layer's graph uses buffers on its own GPU.
The A/B split approach was too complex: it required splitting forward_layer,
handling the eager FMHA section, and fixing per-GPU buffer issues. The
simpler approach captures the entire forward_layer as one graph per layer,
just like the detector test did for L0.
This works because:
- FMHA pads KV to 128 → fixed shape for graph capture
- Compressor returns None on non-boundary steps → graph captures the path
taken during warmup (typically the None path for HCA r=128)
- All sync violations were already fixed in previous commits
The capture still uses dec_pos_buf/dec_tid32_buf on cuda:0 (forward_layer
handles device transfer internally).
Fixes from running Section A detector on B200:
1. single_shot_inference.py: Use pinned CPU buffers for token/position transfer
- dec_tid_buf[0] = python_int causes CPU→GPU sync
- Fixed: write to pinned CPU buffer, then copy_ (async, graph-capturable)
2. grouped_linear.py: Fix expert_offsets Python loop
- expert_offsets[g] = python_int * padded_rows → CPU→GPU sync per iteration
- Fixed: element-wise multiply with pre-allocated range tensor (GPU-only)
3. grouped_linear.py: Vectorized output extraction for T=1 decode
- Python loop z[:, g, :] = out[...] → CPU sync for each slice
- Fixed: GPU gather with pre-computed indices for T=1
4. grouped_linear.py: Pre-allocate output buffer
- torch.empty() per call → allocation inside graph
- Fixed: use self._output_buf (pre-allocated at max size)
5. grouped_linear.py: Pre-allocate expert_offsets_range_buf
- torch.arange() per call → allocation inside graph
- Fixed: compute once at init, reuse via element-wise multiply
The CUDA dequantize_nvfp4 (dsv4/ops/quantize.py) was designed for
activations/KV and assumes row-major (M, N/16) scale layout. Using it
for weight dequantization caused async illegal memory access because
weight scales don't match the kernel's expected layout. The kernel only
validates row count, not width or contiguity.
All 4 call sites now use the PyTorch dequant_nvfp4 (defined in
single_shot_inference.py) which handles weight_scale_2 and input_scale
correctly and cannot cause OOB access:
- Compressor.load: kv_proj, gate_proj
- Indexer.load: weights_proj
- Router gate dequantization in main()
For weight dequantization, gsa should be weight_scale_2 only.
input_scale is the activation global scale — it belongs on the GEMM's
activation side, not the weight side. Using input_scale * ws2 gave
gsa = 6e-8 (essentially zero), making dequantized weights ~0.
The GEMM formula is y = (x * scale_a * gsa) @ (w * scale_b * gsb)
where gsb = input_scale * ws2. But dequantize_nvfp4 is just the
weight half: w_bf16 = lut[w] * block_scale * ws2.
The NVFP4 dequantize formula is w = lut[w_packed] * scale * ws2,
and in the GEMM the global_scale_b = input_scale * ws2. Was incorrectly
using gsb = 1.0 * ws2 (missing input_scale). This would produce
wrongly-scaled BF16 weights from dequantize_nvfp4.
Only the CSA indexer QK path (q_b_proj) is explicitly FP4-QATed.
The rest of the compressor/indexer projections are NOT, so use BF16:
- Compressor kv_proj, gate_proj: dequantize NVFP4 → BF16, F.linear
- Indexer weights_proj: dequantize NVFP4 → BF16, F.linear
- Indexer q_b_proj: KEEP as NVFP4 (this IS the FP4-QATed path)
- Indexer compressor: inherits Compressor's BF16 path
Same as what worked before. The checkpoint stores NVFP4 weights, so we
dequantize once at load time and use cuBLAS F.linear. No FP8 re-quantize
step needed — that was just adding noise on top of the NVFP4 dequant.
lm_head: BF16 F.linear (checkpoint weight is BF16, no quantization)
Router gate: FP8_E4M3 quantize→dequantize round-trip, then F.linear
- Dequantize NVFP4 checkpoint weights to BF16 first
- Quantize to FP8_E4M3 (scale = amax/448)
- Dequantize back to BF16 for F.linear
- Uses BF16 dispatch path in dense_router_dispatch
- Simpler scale wiring than NVFP4 (single per-tensor scale)
- Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs
- Import token strings from encoding.deepseek_v4_encoding (official source)
- Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime
- Use parse_message_from_completion_text() for structured output parsing
- No more hand-rolled prompt construction or hardcoded token IDs
- Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference
- Copied deepseek_v4_encoding.py from vLLM tree to encoding/
- Replaced hand-rolled prompt construction with encode_messages()
- --chat-mode → --thinking-mode (thinking|chat)
- The official encoder handles: BOS, User/Assistant tokens, thinking mode,
tool calls, and all special token placement. It can't drift.
- This is the same code path inference engines will use.
Previous commit added params to forward_layer but forward_attention
(where compressed RoPE is applied) didn't receive them, causing NameError.
Also confirmed from B200 test output: compress_rope_theta=160000 vs
rope_theta=10000 — a 16x difference. The separate cache is essential.
- Fixed comp_pos: (bi*r) block-aligned instead of ((bi+1)*r-1) last-position
- compress_rope_theta: separate rope cache for compressed KV entries
- comp_rope_cos/comp_rope_sin wired to all forward_layer call sites
(prefill chunk loop, decode loop, CUDAGraphDecoder capture)
- forward_layer uses comp_rope caches for compressed RoPE, falls back to normal
- Only single_shot_inference.py modified, no kernel code touched
- CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)*r-1))
instead of FIRST position (bi*r). Off by r-1: 3 for CSA, 127 for HCA.
vLLM uses (position // ratio) * ratio = block-aligned first position.
- Added compress_rope_theta config support (vLLM uses separate theta for compressed)
- Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through)
Only single_shot_inference.py changed — no kernel code touched.
Base commit: 572bdd2
- Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint)
- Each chunk goes through ALL 61 layers before the next chunk
- KV cache append_swa, compressor, indexer all already support T>1
- FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1
- For T>128: splits into multiple launches automatically
- mHC, Router, MoE, Nvfp4Linear all handle M>1 natively
- Eliminates ~N_prefill * 61 per-token overhead from the old loop
- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
- Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
- Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
- TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
- On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
- No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
- Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)
- Modified: single_shot_inference.py
- Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
- Consumes FP8 indexer keys from cache without BF16 dequantization
- Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
- FP32 einsum fallback retained only for T>1 (prefill)
- Removed 'Intentional first-pass limits' section from B1 doc
(those limits ARE the correct production design, not shortcuts)
DSV4 is a reasoning model. The standard prompt format is:
BOS <|User|> prompt <|Assistant|> ◇
Without the ◇ priming, the model is out-of-distribution — it expects to
be inside a thinking block but never received the sentinel. This causes
degenerate output from step 0 (France instead of Paris, looping on
newlines/repeated tokens).
With ◇, the model will:
1. Generate thinking content (reasoning)
2. Emit ◇ (think_end=128822) to close the thinking block
3. Produce the actual answer
4. Emit EOS (token 1)
This matches the pattern described in the Kimi K2 accuracy blog:
https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed
prompt formatting is the #1 cause of degenerate output in chat-tuned
reasoning models.
Previously only stopped on tokenizer.eos_token_id. DSV4 uses special
turn-end tokens (<|end_of_sentence|>, USER_TOKEN=128803) that indicate
the assistant turn is complete. Missing these caused decode to continue
past the model's natural stopping point, producing degenerate output.
Also increased diagnostic logging (every step for first 20 steps) to
catch turn-end token emissions.
- Added --ab-compare flag to run both fused and unfused paths for first 3 layers
- Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv)
- Added --no-fused-rmsnorm to disable P4 and use unfused path
- This will help diagnose the correctness regression introduced by P4
We tried NVFP4 (Blackwell native FP4→MMA). Three approaches.
cos=0.995 round-trip seems fine in isolation but 4.5 effective bits
compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective
bits gives cos=0.9997 — that 0.4% difference is the margin between
working and broken. Kernels exist, path is proven, precision isn't.