Cleanup Part 2: Fix docs, stale references, dead code
- Update README.md package structure to match actual file tree - Remove references to nonexistent fmha.py, fmha_smem_acc, kernels/decode/ - Document live attention path: production.py → fmha_multitile_op → capi.cu → .cuh - Add _archive/ section - Fix loader.py docstring: fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer - Remove preload_all() (dead, referenced nonexistent compressor_reduce_quant.cu)
This commit is contained in:
62
README.md
62
README.md
@@ -137,57 +137,43 @@ One pass, one kernel. No two-loop epilogue, no LSE arithmetic in the merge. This
|
||||
|
||||
```
|
||||
dsv4/
|
||||
├── kernels/ Pure GPU code (CuTeDSL @cute.jit, .cu files)
|
||||
│ ├── attention/ FMHA — FmhaKernel (hd=64/128/256 proven, hd=512 MLIR-blocked)
|
||||
├── kernels/ Pure GPU code
|
||||
│ ├── attention/ Production FMHA — 6-warp TMA multi-tile (.cuh + C-API .cu + op.py + production.py)
|
||||
│ │ production.py is the entry point used by single_shot_inference.py
|
||||
│ ├── gemm/ NVFP4 MoE GEMM (grouped, fused_swiglu, dense, scheduler)
|
||||
│ ├── compressor/ CSA/HCA token-level compressor (CuTeDSL)
|
||||
│ ├── indexer/ CSA indexer score+topk (FP32 scalar today; tensor-core FP4 on roadmap)
|
||||
│ ├── router/ Dense router decode kernel (warp-specialized persistent GEMM)
|
||||
│ ├── cache/ append_swa (writes KV to state cache)
|
||||
│ ├── decode/ Decode-time attention (future)
|
||||
│ └── cuda/ Raw .cu (deinterleave_quantize, sparse_topk_metadata, etc.)
|
||||
│ ├── compressor/ CSA/HCA production compressor (production_compress.py → compressor_reduce.cu)
|
||||
│ ├── indexer/ CSA indexer (stub; live path is inline in single_shot_inference.py)
|
||||
│ ├── router/ Dense router decode + activation_topk
|
||||
│ ├── cuda/ Raw .cu kernels (loader.py compiles on demand)
|
||||
│ └── cache/ (stub; SWA/flush kernels are in cuda/)
|
||||
├── ops/ PyTorch ↔ kernel bridges
|
||||
│ ├── quantize.py BF16 ↔ NVFP4, scale factor handling
|
||||
│ ├── quantize.py BF16 ↔ NVFP4, scale factor handling, QuantizedActivation
|
||||
│ ├── layouts.py Scale swizzle, gate/up interleave, K-major, offsets
|
||||
│ ├── gemm_runner.py Warmup, compile, run grouped/fused GEMMs
|
||||
│ ├── custom_ops.py torch.library.custom_op registrations
|
||||
│ ├── decode_sparse.py native_sparse_decode dispatcher
|
||||
│ ├── rope.py Forward + inverse RoPE (partial, last 64 dims)
|
||||
│ ├── topk.py Sparse top-k metadata wrapper
|
||||
│ └── router.py Router op bridge
|
||||
├── layers/ nn.Module-style components
|
||||
│ ├── rope_cuda.py Forward + inverse RoPE (partial, last 64 dims)
|
||||
│ └── router.py Router op bridge (dense + hash dispatch)
|
||||
├── layers/ nn.Module-style components (used by single_shot_inference.py)
|
||||
│ ├── linear.py Nvfp4Linear
|
||||
│ ├── grouped_linear.py Nvfp4GroupedLinear (output projection)
|
||||
│ ├── moe.py Nvfp4MoE (routed experts)
|
||||
│ ├── shared_expert.py Nvfp4SharedExpert
|
||||
│ ├── mhc.py mHCLayer (Sinkhorn-Knopp, residual mixing)
|
||||
│ ├── attention.py AttentionSubBlock (CSA/HCA/SWA variants by LayerSpec)
|
||||
│ ├── norm.py RMSNorm
|
||||
│ ├── router.py Router (dense + hash modes)
|
||||
│ ├── embedding.py Token embedding + mHC init
|
||||
│ └── ffn.py FFN sub-block
|
||||
├── model/ Model assembly
|
||||
│ └── router.py Router (dense + hash modes)
|
||||
├── model/
|
||||
│ ├── config.py DSV4Config
|
||||
│ ├── layer.py TransformerLayer
|
||||
│ ├── layer_schedule.py LayerSpec, AttentionType, build_schedule, validate_schedule
|
||||
│ ├── mtp.py Multi-token prediction
|
||||
│ ├── sampler.py Token sampler
|
||||
│ └── dsv4.py Full model
|
||||
├── cache/ KV cache infra
|
||||
│ ├── allocator.py Memory allocator
|
||||
│ ├── block_table.py Paged cache block table
|
||||
│ ├── manager.py Cache manager
|
||||
│ ├── paged_cache.py Classical paged cache (CSA/HCA)
|
||||
│ ├── state_cache.py State cache (SWA + uncompressed tail)
|
||||
│ ├── schema.py, handle.py, flush.py, prepare_forward.py
|
||||
├── loader/ Checkpoint I/O
|
||||
│ ├── hf_checkpoint.py
|
||||
│ └── layout_convert.py
|
||||
└── reference/ Slow PyTorch oracles (never imported by production code)
|
||||
├── attention.py, csa_attention.py, compressor.py, moe_pipeline.py
|
||||
│ └── sampler.py CUDASampler
|
||||
├── reference/
|
||||
│ └── single_shot_PYTORCH_REFERENCE.py PyTorch oracle for layer comparison tests
|
||||
└── _archive/ Dead Lineage P code (model/dsv4.py, cache/*, layers/{attention,ffn,norm,embedding}, etc.)
|
||||
Kept for reference; never imported by live code
|
||||
```
|
||||
|
||||
**Dependency arrow:** `kernels/` → `ops/` → `layers/` → `model/`. `reference/` and `loader/` are sidecars.
|
||||
**Live path:** `single_shot_inference.py` → `dsv4/layers/*` → `dsv4/ops/*` → `dsv4/kernels/**`
|
||||
|
||||
**Attention path:** `production.py` → `fmha_multitile_op.py` → `fmha_multitile_capi.cu` → `fmha_6warp_tma_multirow_multitile.cuh`
|
||||
|
||||
**Archived (Lineage P):** `dsv4/model/dsv4.py`, `dsv4/cache/*`, `dsv4/layers/{attention,ffn,norm,embedding}` — these were the vLLM/sglang integration surface but have 0 importers. See `_archive/` if needed.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user