diff --git a/README.md b/README.md index 87ac4bc5..0897dd55 100644 --- a/README.md +++ b/README.md @@ -137,57 +137,43 @@ One pass, one kernel. No two-loop epilogue, no LSE arithmetic in the merge. This ``` dsv4/ -├── kernels/ Pure GPU code (CuTeDSL @cute.jit, .cu files) -│ ├── attention/ FMHA — FmhaKernel (hd=64/128/256 proven, hd=512 MLIR-blocked) +├── kernels/ Pure GPU code +│ ├── attention/ Production FMHA — 6-warp TMA multi-tile (.cuh + C-API .cu + op.py + production.py) +│ │ production.py is the entry point used by single_shot_inference.py │ ├── gemm/ NVFP4 MoE GEMM (grouped, fused_swiglu, dense, scheduler) -│ ├── compressor/ CSA/HCA token-level compressor (CuTeDSL) -│ ├── indexer/ CSA indexer score+topk (FP32 scalar today; tensor-core FP4 on roadmap) -│ ├── router/ Dense router decode kernel (warp-specialized persistent GEMM) -│ ├── cache/ append_swa (writes KV to state cache) -│ ├── decode/ Decode-time attention (future) -│ └── cuda/ Raw .cu (deinterleave_quantize, sparse_topk_metadata, etc.) +│ ├── compressor/ CSA/HCA production compressor (production_compress.py → compressor_reduce.cu) +│ ├── indexer/ CSA indexer (stub; live path is inline in single_shot_inference.py) +│ ├── router/ Dense router decode + activation_topk +│ ├── cuda/ Raw .cu kernels (loader.py compiles on demand) +│ └── cache/ (stub; SWA/flush kernels are in cuda/) ├── ops/ PyTorch ↔ kernel bridges -│ ├── quantize.py BF16 ↔ NVFP4, scale factor handling +│ ├── quantize.py BF16 ↔ NVFP4, scale factor handling, QuantizedActivation │ ├── layouts.py Scale swizzle, gate/up interleave, K-major, offsets │ ├── gemm_runner.py Warmup, compile, run grouped/fused GEMMs │ ├── custom_ops.py torch.library.custom_op registrations -│ ├── decode_sparse.py native_sparse_decode dispatcher -│ ├── rope.py Forward + inverse RoPE (partial, last 64 dims) -│ ├── topk.py Sparse top-k metadata wrapper -│ └── router.py Router op bridge -├── layers/ nn.Module-style components +│ ├── rope_cuda.py Forward + inverse RoPE (partial, last 64 dims) +│ └── router.py Router op bridge (dense + hash dispatch) +├── layers/ nn.Module-style components (used by single_shot_inference.py) │ ├── linear.py Nvfp4Linear │ ├── grouped_linear.py Nvfp4GroupedLinear (output projection) │ ├── moe.py Nvfp4MoE (routed experts) │ ├── shared_expert.py Nvfp4SharedExpert │ ├── mhc.py mHCLayer (Sinkhorn-Knopp, residual mixing) -│ ├── attention.py AttentionSubBlock (CSA/HCA/SWA variants by LayerSpec) -│ ├── norm.py RMSNorm -│ ├── router.py Router (dense + hash modes) -│ ├── embedding.py Token embedding + mHC init -│ └── ffn.py FFN sub-block -├── model/ Model assembly +│ └── router.py Router (dense + hash modes) +├── model/ │ ├── config.py DSV4Config -│ ├── layer.py TransformerLayer -│ ├── layer_schedule.py LayerSpec, AttentionType, build_schedule, validate_schedule -│ ├── mtp.py Multi-token prediction -│ ├── sampler.py Token sampler -│ └── dsv4.py Full model -├── cache/ KV cache infra -│ ├── allocator.py Memory allocator -│ ├── block_table.py Paged cache block table -│ ├── manager.py Cache manager -│ ├── paged_cache.py Classical paged cache (CSA/HCA) -│ ├── state_cache.py State cache (SWA + uncompressed tail) -│ ├── schema.py, handle.py, flush.py, prepare_forward.py -├── loader/ Checkpoint I/O -│ ├── hf_checkpoint.py -│ └── layout_convert.py -└── reference/ Slow PyTorch oracles (never imported by production code) - ├── attention.py, csa_attention.py, compressor.py, moe_pipeline.py +│ └── sampler.py CUDASampler +├── reference/ +│ └── single_shot_PYTORCH_REFERENCE.py PyTorch oracle for layer comparison tests +└── _archive/ Dead Lineage P code (model/dsv4.py, cache/*, layers/{attention,ffn,norm,embedding}, etc.) + Kept for reference; never imported by live code ``` -**Dependency arrow:** `kernels/` → `ops/` → `layers/` → `model/`. `reference/` and `loader/` are sidecars. +**Live path:** `single_shot_inference.py` → `dsv4/layers/*` → `dsv4/ops/*` → `dsv4/kernels/**` + +**Attention path:** `production.py` → `fmha_multitile_op.py` → `fmha_multitile_capi.cu` → `fmha_6warp_tma_multirow_multitile.cuh` + +**Archived (Lineage P):** `dsv4/model/dsv4.py`, `dsv4/cache/*`, `dsv4/layers/{attention,ffn,norm,embedding}` — these were the vLLM/sglang integration surface but have 0 importers. See `_archive/` if needed. ---