diff --git a/README.md b/README.md index 6726849e..b0a335be 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,73 @@ # DSV4 Inference Kernel +## Architecture + +DSV4 is **not MLA**. It uses **CSA (Compressed Sparse Attention, m=4)** and **HCA (Heavily Compressed Attention, m′=128)**. KV latent is (T, 512) shared across all 128 heads. Sink weights merge sparse + SWA attention. vLLM misnames this as "MLA" — it is not. The architecture is fundamentally different. + +``` +DSV4 inference pipeline — component status +========================================== + +Legend: + [✓] built and tested + [~] partial — reference or seam exists, native pending + [✗] to build + + + ┌────────────────────────────────────┐ + │ [✗] Embedding + mHC init │ + │ token embed + n_hc=4 streams │ + └────────────────┬───────────────────┘ + │ + ▼ +┌─ Transformer layer × L ──────────────────────────────────────────────┐ +│ HCA on layers 0–1 of Pro, alternating CSA / HCA after │ +│ │ +│ ┌─ Attention sub-block ──────────────────────────────────────────┐ │ +│ │ [✓] Residual mHC pre + post mix │ │ +│ │ [~] Norms + RoPE RMSNorm + partial RoPE │ │ +│ │ [✓] Q / KV projection NVFP4 linears + LoRA │ │ +│ │ [~] Token compressor CSA m=4 / HCA m′=128 │ │ +│ │ [✗] Indexer + top-k CSA only, FP4 QK │ │ +│ │ [~] FMHA core QK → online softmax → PV │ │ +│ │ + SWA branch + sink merge │ │ +│ │ [✓] Output projection inv RoPE + wo_a grouped + wo_b │ │ +│ └────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─ FFN sub-block ────────────────────────────────────────────────┐ │ +│ │ [✓] Residual mHC pre + post mix │ │ +│ │ [~] Pre-FFN norm RMSNorm │ │ +│ │ [✗] Router sqrt(softplus) + topk + hash │ │ +│ │ [✓] Routed MoE fused SwiGLU L1 + L2 │ │ +│ │ [✓] Shared expert NVFP4 single-group GEMM │ │ +│ └────────────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────┬───────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ [✗] Final RMSNorm → [✗] LM head → [✗] MTP (depth=1) → [✗] Sampler │ +└──────────────────────────────────────────────────────────────────────┘ + +┌─ Supporting infrastructure ──────────────────────────────────────────┐ +│ [✗] KV cache management │ +│ • state cache: SWA window + uncompressed tail per layer │ +│ • classical paged cache: lcm(m, m′) = 128 tokens per block │ +│ • heterogeneous layout per layer │ +└──────────────────────────────────────────────────────────────────────┘ + + +Summary +------- + Built [✓] : 6 — mHC ×2, Q/KV proj, output proj, routed MoE, + shared expert + Partial [~] : 4 — norms+RoPE, token compressor, FMHA core, + pre-FFN norm + To build [✗] : 8 — embedding+init, indexer+top-k, router, + final norm, LM head, MTP, sampler, KV cache +``` + +--- + ## Status (May 21, 2026 — 17:30 UTC) | Stage | Status | Description |