README: add full DSV4 pipeline architecture diagram (CSA/HCA, not MLA)
This commit is contained in:
68
README.md
68
README.md
@@ -1,5 +1,73 @@
|
||||
# DSV4 Inference Kernel
|
||||
|
||||
## Architecture
|
||||
|
||||
DSV4 is **not MLA**. It uses **CSA (Compressed Sparse Attention, m=4)** and **HCA (Heavily Compressed Attention, m′=128)**. KV latent is (T, 512) shared across all 128 heads. Sink weights merge sparse + SWA attention. vLLM misnames this as "MLA" — it is not. The architecture is fundamentally different.
|
||||
|
||||
```
|
||||
DSV4 inference pipeline — component status
|
||||
==========================================
|
||||
|
||||
Legend:
|
||||
[✓] built and tested
|
||||
[~] partial — reference or seam exists, native pending
|
||||
[✗] to build
|
||||
|
||||
|
||||
┌────────────────────────────────────┐
|
||||
│ [✗] Embedding + mHC init │
|
||||
│ token embed + n_hc=4 streams │
|
||||
└────────────────┬───────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─ Transformer layer × L ──────────────────────────────────────────────┐
|
||||
│ HCA on layers 0–1 of Pro, alternating CSA / HCA after │
|
||||
│ │
|
||||
│ ┌─ Attention sub-block ──────────────────────────────────────────┐ │
|
||||
│ │ [✓] Residual mHC pre + post mix │ │
|
||||
│ │ [~] Norms + RoPE RMSNorm + partial RoPE │ │
|
||||
│ │ [✓] Q / KV projection NVFP4 linears + LoRA │ │
|
||||
│ │ [~] Token compressor CSA m=4 / HCA m′=128 │ │
|
||||
│ │ [✗] Indexer + top-k CSA only, FP4 QK │ │
|
||||
│ │ [~] FMHA core QK → online softmax → PV │ │
|
||||
│ │ + SWA branch + sink merge │ │
|
||||
│ │ [✓] Output projection inv RoPE + wo_a grouped + wo_b │ │
|
||||
│ └────────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌─ FFN sub-block ────────────────────────────────────────────────┐ │
|
||||
│ │ [✓] Residual mHC pre + post mix │ │
|
||||
│ │ [~] Pre-FFN norm RMSNorm │ │
|
||||
│ │ [✗] Router sqrt(softplus) + topk + hash │ │
|
||||
│ │ [✓] Routed MoE fused SwiGLU L1 + L2 │ │
|
||||
│ │ [✓] Shared expert NVFP4 single-group GEMM │ │
|
||||
│ └────────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────┬───────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────┐
|
||||
│ [✗] Final RMSNorm → [✗] LM head → [✗] MTP (depth=1) → [✗] Sampler │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌─ Supporting infrastructure ──────────────────────────────────────────┐
|
||||
│ [✗] KV cache management │
|
||||
│ • state cache: SWA window + uncompressed tail per layer │
|
||||
│ • classical paged cache: lcm(m, m′) = 128 tokens per block │
|
||||
│ • heterogeneous layout per layer │
|
||||
└──────────────────────────────────────────────────────────────────────┘
|
||||
|
||||
|
||||
Summary
|
||||
-------
|
||||
Built [✓] : 6 — mHC ×2, Q/KV proj, output proj, routed MoE,
|
||||
shared expert
|
||||
Partial [~] : 4 — norms+RoPE, token compressor, FMHA core,
|
||||
pre-FFN norm
|
||||
To build [✗] : 8 — embedding+init, indexer+top-k, router,
|
||||
final norm, LM head, MTP, sampler, KV cache
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Status (May 21, 2026 — 17:30 UTC)
|
||||
|
||||
| Stage | Status | Description |
|
||||
|
||||
Reference in New Issue
Block a user