README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA)

2026-05-21 17:40:25 +00:00
parent 364d9edcd3
commit e8485b9cf5
1 changed files with 68 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -1,5 +1,73 @@
 # DSV4 Inference Kernel

+## Architecture
+
+DSV4 is **not MLA**. It uses **CSA (Compressed Sparse Attention, m=4)** and **HCA (Heavily Compressed Attention, m′=128)**. KV latent is (T, 512) shared across all 128 heads. Sink weights merge sparse + SWA attention. vLLM misnames this as "MLA" — it is not. The architecture is fundamentally different.
+
+```
+DSV4 inference pipeline — component status
+==========================================
+
+Legend:
+ [✓] built and tested
+ [~] partial — reference or seam exists, native pending
+ [✗] to build
+
+
+ ┌────────────────────────────────────┐
+ │ [✗] Embedding + mHC init          │
+ │ token embed + n_hc=4 streams      │
+ └────────────────┬───────────────────┘
+                  │
+                  ▼
+┌─ Transformer layer × L ──────────────────────────────────────────────┐
+│ HCA on layers 0–1 of Pro, alternating CSA / HCA after              │
+│                                                                      │
+│ ┌─ Attention sub-block ──────────────────────────────────────────┐  │
+│ │ [✓] Residual mHC pre + post mix                               │  │
+│ │ [~] Norms + RoPE             RMSNorm + partial RoPE           │  │
+│ │ [✓] Q / KV projection        NVFP4 linears + LoRA             │  │
+│ │ [~] Token compressor         CSA m=4 / HCA m′=128             │  │
+│ │ [✗] Indexer + top-k          CSA only, FP4 QK                 │  │
+│ │ [~] FMHA core                QK → online softmax → PV         │  │
+│ │                              + SWA branch + sink merge         │  │
+│ │ [✓] Output projection        inv RoPE + wo_a grouped + wo_b   │  │
+│ └────────────────────────────────────────────────────────────────┘  │
+│                                                                      │
+│ ┌─ FFN sub-block ────────────────────────────────────────────────┐  │
+│ │ [✓] Residual mHC pre + post mix                               │  │
+│ │ [~] Pre-FFN norm              RMSNorm                          │  │
+│ │ [✗] Router                    sqrt(softplus) + topk + hash     │  │
+│ │ [✓] Routed MoE               fused SwiGLU L1 + L2             │  │
+│ │ [✓] Shared expert            NVFP4 single-group GEMM          │  │
+│ └────────────────────────────────────────────────────────────────┘  │
+└──────────────────────────────────┬───────────────────────────────────┘
+                                  │
+                                  ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│ [✗] Final RMSNorm → [✗] LM head → [✗] MTP (depth=1) → [✗] Sampler │
+└──────────────────────────────────────────────────────────────────────┘
+
+┌─ Supporting infrastructure ──────────────────────────────────────────┐
+│ [✗] KV cache management                                             │
+│ • state cache: SWA window + uncompressed tail per layer             │
+│ • classical paged cache: lcm(m, m′) = 128 tokens per block         │
+│ • heterogeneous layout per layer                                    │
+└──────────────────────────────────────────────────────────────────────┘
+
+
+Summary
+-------
+ Built  [✓] : 6 — mHC ×2, Q/KV proj, output proj, routed MoE,
+               shared expert
+ Partial [~] : 4 — norms+RoPE, token compressor, FMHA core,
+               pre-FFN norm
+ To build [✗] : 8 — embedding+init, indexer+top-k, router,
+               final norm, LM head, MTP, sampler, KV cache
+```
+
+---
+
 ## Status (May 21, 2026 — 17:30 UTC)

 | Stage | Status | Description |