Commit Graph

220 Commits

Author SHA1 Message Date
8cfc1cae58 Canonical encoding: derive special token IDs from official encoding module + tokenizer
- Remove hardcoded THINK_START/THINK_END/USER_TOKEN/ASSISTANT_TOKEN IDs
- Import token strings from encoding.deepseek_v4_encoding (official source)
- Resolve IDs via tokenizer.convert_tokens_to_ids() at runtime
- Use parse_message_from_completion_text() for structured output parsing
- No more hand-rolled prompt construction or hardcoded token IDs
- Clean up TEMP: replace old deepseek_v4_ref with dsv4thing.zip reference
2026-06-03 10:23:02 +00:00
a86d6d90a5 Replace hand-rolled prompt with official DSV4 encoder (canonical path)
- Copied deepseek_v4_encoding.py from vLLM tree to encoding/
- Replaced hand-rolled prompt construction with encode_messages()
- --chat-mode → --thinking-mode (thinking|chat)
- The official encoder handles: BOS, User/Assistant tokens, thinking mode,
  tool calls, and all special token placement. It can't drift.
- This is the same code path inference engines will use.
2026-06-03 09:59:05 +00:00
284fc9ca86 Fix: thread comp_rope_cos/comp_rope_sin through forward_attention
Previous commit added params to forward_layer but forward_attention
(where compressed RoPE is applied) didn't receive them, causing NameError.

Also confirmed from B200 test output: compress_rope_theta=160000 vs
rope_theta=10000 — a 16x difference. The separate cache is essential.
2026-06-03 09:30:57 +00:00
6a3374da18 Cross-check 2 complete: block-aligned comp_pos + compress_rope_theta wired through
- Fixed comp_pos: (bi*r) block-aligned instead of ((bi+1)*r-1) last-position
- compress_rope_theta: separate rope cache for compressed KV entries
- comp_rope_cos/comp_rope_sin wired to all forward_layer call sites
  (prefill chunk loop, decode loop, CUDAGraphDecoder capture)
- forward_layer uses comp_rope caches for compressed RoPE, falls back to normal
- Only single_shot_inference.py modified, no kernel code touched
2026-06-03 09:19:11 +00:00
5003e756e2 WIP: cross-check 2 fix — block-aligned compressed RoPE positions + compress_rope_theta support
- CRITICAL BUG FIX: comp_pos was using LAST position of each block (((bi+1)*r-1))
  instead of FIRST position (bi*r). Off by r-1: 3 for CSA, 127 for HCA.
  vLLM uses (position // ratio) * ratio = block-aligned first position.
- Added compress_rope_theta config support (vLLM uses separate theta for compressed)
- Added comp_rope_cos/comp_rope_sin param to forward_layer (not yet wired through)

Only single_shot_inference.py changed — no kernel code touched.
Base commit: 572bdd2
2026-06-03 09:17:54 +00:00
1e77dfcaa0 Fix prompt encoding: remove \n\n before content per official DSV4 spec; add --chat-mode 2026-06-03 08:19:33 +00:00
019a3a34b7 Clean up L0 B1 verify noise (gate on VERBOSE), update FINAL_STRETCH.md
Batched prefill + T>128 chunking now complete. All dangling items in
FINAL_STRETCH.md are marked done.
2026-06-03 08:12:54 +00:00
60309ef124 Batched prefill: replace T=1 token-by-token with chunked T≤128 batch processing
- Process prefill tokens in chunks of up to 128 (FMHA T≤128 constraint)
- Each chunk goes through ALL 61 layers before the next chunk
- KV cache append_swa, compressor, indexer all already support T>1
- FMHA dispatches to dsv4_attention_mixed_fp8_prefill for T>1
- For T>128: splits into multiple launches automatically
- mHC, Router, MoE, Nvfp4Linear all handle M>1 natively
- Eliminates ~N_prefill * 61 per-token overhead from the old loop
2026-06-03 07:39:37 +00:00
75288bd12f Wire prefill FMHA into production.py and single_shot
- Add dsv4_attention_mixed_fp8_prefill to production.py
- _run_production_fmha_mixed now dispatches to prefill kernel for T>1
- Remove decode-only T==1 restriction
- Update FINAL_STRETCH.md: prefill marked DONE, batched prefill TODO noted
2026-06-03 03:49:57 +00:00
af58f2c5b2 Add B1 weight/format verification at L0 in single_shot 2026-06-03 01:52:55 +00:00
b9243fe40a B2: FP8 tensor-core indexer scoring + weighted ReLU + top-k
- New kernel: dsv4/kernels/cuda/indexer_fp8_score_topk.cu
  - Native Blackwell FP8 GEMM via tcgen05.mma.kind::f8f6f4
  - Q (n_ih=64, ihd=128) quantized BF16→FP8, K consumed directly as FP8_E4M3
  - TMEM read using 16x256b.x1 (4-warps parallel, proven from B1 FMHA)
  - On-the-fly: dequant (q_scale*k_scale) → ReLU → weighted sum → top-k
  - No global BF16 staging of indexer keys, no FP32 einsum on CUDA cores
  - Per-thread register heap top-k (same algorithm as indexer_score_topk.cu)

- Modified: single_shot_inference.py
  - Indexer.forward() now takes kv_cache directly (not comp_idx_kv BF16)
  - Consumes FP8 indexer keys from cache without BF16 dequantization
  - Dispatches to B2 FP8 kernel for T=1, n_ih=64, ihd=128 (production decode)
  - FP32 einsum fallback retained only for T>1 (prefill)

- Removed 'Intentional first-pass limits' section from B1 doc
  (those limits ARE the correct production design, not shortcuts)
2026-06-02 23:18:54 +00:00
a9d5e09f4c B1: mixed FP8/BF16 decode FMHA integration
- New: fmha_mixed_fp8_decode.cuh (Blackwell FP8 tensor-core FMHA kernel)
- New: fmha_mixed_fp8_capi.cu (C ABI launcher)
- New: fmha_mixed_fp8_op.py (Python ctypes/nvcc bridge)
- New: fp8_attention_io.cu (Q quantize + mixed KV gather kernels)
- New: fmha_umma_desc.cuh additions (f8f6f4 UMMA + idesc helpers)
- Modified: production.py (dsv4_attention_mixed_fp8_decode API)
- Modified: single_shot_inference.py (B1 gather + FMHA path)
- Modified: __init__.py (export mixed FP8 API)
- New: docs/B1_MIXED_FP8_FMHA.md, FINAL_STRETCH.md

noPE KV stays FP8_E4M3 + per-row scale, RoPE stays BF16.
No global FP8->BF16 KV staging before FMHA.
Decode-only (T==1), specialized HD=512/NOPE=448/ROPE=64.
CUDA compile/runtime validation pending on B200.
2026-06-02 22:53:14 +00:00
9d4a014fad Fix NameError: dequantize_nvfp4 not in scope in forward_attention
The B3 fused q_a_norm path used dequantize_nvfp4 but it was only
imported in forward_layer, not forward_attention. Added local import.
2026-06-02 21:52:29 +00:00
0b6ca0df80 P5 integration + B3 q_a_norm fused + gsa scalar fix
P5: Wire up fused mHC pre_block + RMSNorm + NVFP4 quantize kernel
- Replaces: pre_block bmm + rmsnorm (4+ launches) + quantize (2 launches)
- With: 2 kernel launches (mhc_rmsnorm_amax_gsa + mhc_rmsnorm_quantize_nvfp4)
- Both attn and ffn mHC paths now use P5 fused kernel
- Savings: ~5 launches/site × 2 sites × 61 layers = 610 launches/token

B3: Fused rmsnorm+quant for q_a_norm → q_b path
- q_a output → rmsnorm_quantize_nvfp4 → QuantizedActivation → q_b.run_from_quantized
- Eliminates BF16 round-trip between q_a_norm and q_b GEMM
- Saves: ~6 kernel launches per layer (rmsnorm 4+ + quantize 2 vs fused 2)

gsa scalar fix in Nvfp4Linear.run_from_quantized:
- CuTeDSL NVFP4 GEMM expects global_scale_a as per-expert scalar (shape (1,))
- Per-row gsa from fused kernels must be reduced to scalar (max) for M>1
- For M=1 decode: already scalar, no reduction needed
- Fixes potential correctness issue at prefill (M>1) when using fused paths

Cleanup: Remove --ab-compare flag and A/B comparison code (replaced by P5)
2026-06-02 21:20:34 +00:00
7e42b5e090 A1: Add ◇ (think_start) priming after Assistant token
DSV4 is a reasoning model. The standard prompt format is:
  BOS <|User|> prompt <|Assistant|> ◇
Without the ◇ priming, the model is out-of-distribution — it expects to
be inside a thinking block but never received the sentinel. This causes
degenerate output from step 0 (France instead of Paris, looping on
newlines/repeated tokens).

With ◇, the model will:
1. Generate thinking content (reasoning)
2. Emit ◇ (think_end=128822) to close the thinking block
3. Produce the actual answer
4. Emit EOS (token 1)

This matches the pattern described in the Kimi K2 accuracy blog:
https://vllm.ai/blog/2025-10-28-kimi-k2-accuracy — malformed
prompt formatting is the #1 cause of degenerate output in chat-tuned
reasoning models.
2026-06-02 20:23:47 +00:00
ecd48ab65e A1: Add explicit stop set for DSV4 turn-end tokens
Previously only stopped on tokenizer.eos_token_id. DSV4 uses special
turn-end tokens (<|end_of_sentence|>, USER_TOKEN=128803) that indicate
the assistant turn is complete. Missing these caused decode to continue
past the model's natural stopping point, producing degenerate output.

Also increased diagnostic logging (every step for first 20 steps) to
catch turn-end token emissions.
2026-06-02 19:59:52 +00:00
eb5ef93bf1 Add A/B comparison mode for P4 fused vs unfused RMSNorm+quantize
- Added --ab-compare flag to run both fused and unfused paths for first 3 layers
- Compares x_normed, gsa values, FP4 data, and GEMM outputs (q_a, kv)
- Added --no-fused-rmsnorm to disable P4 and use unfused path
- This will help diagnose the correctness regression introduced by P4
2026-06-02 18:49:30 +00:00
7bb3207347 P4: Integrate fused RMSNorm+quantize into single_shot (attention path)
- forward_layer: use rmsnorm_quantize_nvfp4 for attn_norm
- forward_attention: accept x_quant, use run_from_quantized for q_a/kv
- Dequantize for compressor/indexer (still saves 2+ launches per site)
- FFN path kept unfused — MoE internal quantization needs refactoring (P5)
- _use_fused_rmsnorm_quantize flag to toggle (default True)
2026-06-02 16:38:44 +00:00
82294fc21e Fix nope_dim UnboundLocalError — hoist to function scope 2026-06-02 11:18:58 +00:00
c89762ecdd Fix set_indexer_keys_fp8 None guard + store comp_pos in mixed storage 2026-06-02 10:20:26 +00:00
1f69f61363 Add detailed comment: why compressed KV uses FP8 not NVFP4
We tried NVFP4 (Blackwell native FP4→MMA). Three approaches.
cos=0.995 round-trip seems fine in isolation but 4.5 effective bits
compounds fatally across 61 layers of mHC. FP8_E4M3's 5.3 effective
bits gives cos=0.9997 — that 0.4% difference is the margin between
working and broken. Kernels exist, path is proven, precision isn't.
2026-06-02 10:19:54 +00:00
edc8e7ee8d KV-1/KV-2: Mixed FP8+BF16 compressed KV (DeepSeek V4 paper format)
Architecture matches paper: 'BF16 for RoPE dims, FP8 for remaining dims'
- Non-RoPE dims (448 of 512): FP8_E4M3 storage → dequant to BF16 for FMHA
- RoPE dims (64 of 512): BF16 storage (RoPE applied directly, no conversion)
- Indexer keys: FP8_E4M3 (ihd=128, no RoPE)
- SWA: BF16 (unchanged)

Pipeline:
  Compressor → FP32 → split → [nope: FP32→FP8] + [rope: FP32→BF16→RoPE]
  Gather: [nope: FP8→BF16] + [rope: BF16] → concat → FMHA

No BF16 intermediate for non-RoPE data.
No FP32 intermediate after BF16 RoPE.
BF16 is the final format consumed by FMHA (no further conversion).

KVCache rewritten:
- comp_nope_fp8/scale: FP8 storage for non-RoPE
- comp_rope_bf16: BF16 storage for RoPE
- comp_nope_selective/all: FP8→BF16 dequant
- comp_rope_selective/all: BF16 gather
- set_compressed_mixed: write mixed format
- set_indexer_keys_fp8: write FP8 indexer keys
2026-06-02 10:08:43 +00:00
7ef6402936 KV-1/KV-2/KV-3: NVFP4 compressed KV + FP8 indexer keys
Architecture:
- Compressed KV: stored as NVFP4 (E2M1 + E4M3 + FP32 gsa)
  - Write path: compress→FP32 → FP32 RoPE → quantize FP32→NVFP4
  - Read path: dequant_nvfp4/dequant_nvfp4_selective → BF16 for FMHA
  - No BF16 intermediate in the write path
- Indexer keys: stored as FP8_E4M3 (1 byte + per-row scale)
  - Write path: compress→FP32 → quantize FP32→FP8_E4M3
  - Read path: dequant_fp8_e4m3 → BF16 for scoring
- SWA: remains BF16 (8MB total, fits in L2)

New kernels in kv_quantize.cu:
- compute_amax_gsa_fp32: per-row gsa from FP32 input
- quantize_nvfp4_from_fp32: FP32→NVFP4 with GPU gsa buffer
- quantize_fp8_e4m3_from_fp32: FP32→FP8_E4M3 for indexer keys
- dequant_fp8_e4m3 / dequant_fp8_e4m3_selective: FP8→BF16
- rope_fp32: FP32 GPT-J interleaved RoPE (no BF16)

Proven two-kernel pattern (same as quantize_nvfp4_gpu_fused):
  Kernel 1: amax_gsa (GPU-only)
  Kernel 2: quantize from buffer (GPU gsa)
No shared memory bugs. No cross-CTA race conditions.

KVCache updated:
- comp_kv_fp4/sf/gsa: NVFP4 storage (3.5× smaller than BF16)
- comp_idx_fp8/scale: FP8_E4M3 storage (1.9× smaller than BF16)
- comp_kv property: dequant NVFP4→BF16 on demand
- comp_kv_selective: dequant only top-k entries (bandwidth savings)
- comp_idx_kv property: dequant FP8→BF16 on demand

Removed: compressor_reduce_quant.cu (buggy single-kernel approach)
2026-06-02 10:00:50 +00:00
3c295f225a P3: integrate CUDA RoPE kernel into single_shot — 732 launches/token eliminated
_apply_rope now uses dsv4.ops.rope_cuda (1 CUDA kernel per call)
instead of PyTorch ops (5-6 kernels per call).
Total: 183 RoPE calls × (5-1) = 732 launches saved per token.
With fallback to PyTorch if CUDA kernel fails.
2026-06-02 09:08:07 +00:00
553275d810 feat: P1 — add eager warmup_fused_swiglu_compilation for SharedExpert (1-group) 2026-06-02 08:25:52 +00:00
d8e17d70c1 P0+P1+P2: Enable fused SwiGLU (MoE+SE), fix SE _run_l1_fused, remove per-call gsa fill_
P0: Enable fused SwiGLU for MoE (set_fused_swiglu(True))
  - Saves 240+ unfused BF16 kernel launches per token
  - SiLU + clamp in kernel registers instead of separate launches

P1: Fix shared expert _run_l1_fused + enable fused SwiGLU
  - Fixed: _l1_sf_view -> _l1_scale_b, _l1_gs_view -> _l1_gsb
  - Fixed: expert_offsets dtype int64 -> int32
  - Added proper padded buffer + scale assembly (matching unfused path)
  - Added runtime gsa support (quantize_nvfp4_gpu_fused)

P2: Remove per-call gsa_buf.fill_() in Nvfp4Linear
  - fill_() was H2D transfer every forward pass (~5µs × 244 calls = ~1.2ms/token)
  - _gsa_buf now initialized with _activation_global_scale (not zeros)
  - After warmup_gsa, buffer already has correct value — no fill needed
2026-06-02 07:57:39 +00:00
790f8c350a perf: P2 landed (gsa fill elimination). P0/P1 fused SwiGLU disabled — CuTeDSL kernel arg-binding bug.
P0/P1: The fused SwiGLU kernel's warmup_fused_swiglu_compilation() triggers
'TypeError: too many positional arguments' during cute.compile(). The kernel
signature doesn't match the positional args being passed. This is a kernel-side
fix, not a single_shot fix. Disabled until the fused kernel is debugged.

P2: Landed — Nvfp4Linear skips redundant _gsa_buf.fill_() after warmup.

SE fused SwiGLU infrastructure (set_fused_swiglu, _run_l1_fused, interleaved
weight path) is wired but disabled. Will activate once kernel fix lands.
2026-06-02 07:16:08 +00:00
040b2eb6e7 perf: P0/P1/P2 — fused SwiGLU for MoE+SE, eliminate per-call gsa fill
P0: Enable fused SwiGLU for all MoE instances (moe._fused_swiglu = True).
    Eliminates ~8 BF16 kernel launches per MoE per token (gate/up split,
    SiLU, clamp, elementwise multiply → single fused kernel launch).

P1: Enable fused SwiGLU for shared expert (SE):
    - Added set_fused_swiglu() method to Nvfp4SharedExpert
    - Added _run_l1_fused() using run_fused_swiglu_grouped_gemm (1-group)
    - Interleave L1 weights at finalize time for fused kernel compatibility
    - Fused kernel handles SwiGLU + clamp in registers, outputs BF16

P2: Eliminate per-call _gsa_buf.fill_() in Nvfp4Linear:
    - _activation_global_scale is set once at warmup, never changes after
    - Skip redundant fill_() via _gsa_buf_initialized flag
    - Saves 244 CPU→GPU scalar fills per token (4 linears × 61 layers)

P3: Deferred (in-kernel RoPE fusion — kernel-side change, not single_shot)
2026-06-02 06:59:25 +00:00
e9506e0c20 perf: C1/C2/C3 — per-layer max_comp, pre-allocated gather_buf, SWA views
C1: --max-context CLI flag (default 8192). KVCache.max_comp computed from
    (max_context + compress_ratio - 1) // ratio per layer type.
    CSA at 8192 context → 2048 entries. HCA at 8192 → 64 entries.
    No more hardcoded 65536 that wastes memory on HCA layers.

C2: Pre-allocated gather_buf (indexer_top_k + window_size, hd) in KVCache.
    Gather writes compressed+SWA into this buffer via slice assignment.
    Zero torch.cat allocations on the hot decode path.

C3: get_swa returns views (no .clone()). Ring-buffer wrap returns indexed
    views. Caller copies into gather_buf so no aliasing risk.
2026-06-02 06:18:06 +00:00
617da29a5b fix: assert topk_idx is not None in CSA layers — no silent fallback to SWA-only
The indexer silently returning None caused CSA layers to attend over only the
SWA window (128 tokens), not the compressed sparse KV. This went undetected
because the model still produced plausible output at short context. The assert
makes any future indexer regression immediately visible.
2026-06-02 06:14:23 +00:00
5b4c496512 fix: three indexer bugs — weight path, comp_idx_buf width, scoring einsum
1. Indexer.load: weights at *.indexer.kv_proj not *.indexer.compressor.kv_proj
2. KVCache.comp_idx_buf: width=ihd (128) not head_dim (512); parametric via indexer_key_dim
3. Indexer.forward: stored keys are (n_comp, ihd) not (n_comp, n_ih, ihd);
   einsum changed from 'tnd,cnd->tnc' to 'tnd,cd->tnc' — key shared across indexer heads
   (paper's c_I = ihd = 128, one vector per compressed block)

Also removed probe diagnostics (COMPRESSOR BUFFERING, COMPRESSOR OUT, INDEXER SKIP,
RESHAPE FAILURE, indexer load state) — served their purpose.
2026-06-02 05:53:10 +00:00
8162c586c3 probe: fix comp_idx_buf width to ihd=128 so indexer probe can complete 2026-06-02 05:38:44 +00:00
5be31d8582 fix: indexer compressor weight path — weights are at *.indexer.kv_proj not *.indexer.compressor.kv_proj 2026-06-02 05:25:44 +00:00
fdfcca918c probe: verify indexer compressor load state 2026-06-02 05:17:00 +00:00
fb0ed87626 probe: add indexer compressor early-return and buffering diagnostics 2026-06-02 05:06:18 +00:00
06c92f208f INDEXER PROBE: instrumentation prints for compressed key width investigation 2026-06-02 04:44:47 +00:00
f0dec9f6bd profile: fine-grained attention component timing 2026-06-02 03:08:34 +00:00
7114c48575 fix: parenthesize profile_detail condition 2026-06-02 02:56:13 +00:00
4734e894c7 profile: add per-layer attn vs ffn timing with CUDA sync 2026-06-02 02:46:35 +00:00
4017ef2f16 fix: accurate profile sync + remove paris_tids 129K iteration 2026-06-01 23:55:26 +00:00
73ae9393da FIX: RoPE cache 8192→65536 (original_max_position_embeddings), KVCache max_comp 32768→65536 2026-06-01 23:18:37 +00:00
36f9782bad Add thinking/Paris token logit check on step 0 for quality debugging 2026-06-01 23:14:24 +00:00
ef7e0d63bb Add --warmup-gsa flag: fix attention/router gsa after first decode step to eliminate amax kernel launches 2026-06-01 23:04:44 +00:00
008e59eb90 Add --profile flag: per-component GPU timing with CUDA sync (embed+layers, lm_head, sampling) 2026-06-01 23:03:46 +00:00
e53645654d Reduce hot-path .item() syncs: gate li>=58 diagnostics behind VERBOSE>=2, topk on float 2026-06-01 22:33:03 +00:00
6f4bbc997a Add sync after sampler for step<3 to catch async CUDA errors early 2026-06-01 22:32:40 +00:00
5493a8727e P7: compressor early return + decode buffering (skip GEMMs when n_complete=0); sampler SMEM fix (LK=24 fits 48KB default); topk on float not bf16 2026-06-01 22:29:56 +00:00
583ad6cfe6 P0 complete: Kill .item() in grouped_linear, reduce hot-path syncs
- grouped_linear.py: Replace .item() gsa + Python quantize with
  quantize_nvfp4_gpu_fused (zero CPU syncs). Flatten all groups
  into (G*T, D), single fused kernel launch, GPU-only gsa copy.
- single_shot_inference.py: Reduce torch.cuda.synchronize() to
  every 20 steps instead of every step. Gate per-layer diagnostics
  to li<3 or li>=58 (avoid 61 .item() calls per decode step).
2026-06-01 22:21:12 +00:00
8767c263ab Add cuda.synchronize + better logits validation after lm_head
Catch CUDA errors at the source instead of seeing them
surfaced at torch.topk. Print logits stats every step.
2026-06-01 22:06:41 +00:00
2a6f9a10b1 lm_head: fall back to BF16 F.linear for stability
NVFP4 quantize_from_buffer produces CUDA error on large-magnitude
inputs (|X|>500 at L60 output). BF16 lm_head is correct and only
runs once per decode step — not a bottleneck.

TODO: debug the NVFP4 path for large activations and re-enable.
2026-06-01 22:05:22 +00:00