Cleanup Part 2: Fix docs, stale references, dead code

- Update README.md package structure to match actual file tree
  - Remove references to nonexistent fmha.py, fmha_smem_acc, kernels/decode/
  - Document live attention path: production.py → fmha_multitile_op → capi.cu → .cuh
  - Add _archive/ section
- Fix loader.py docstring: fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer
- Remove preload_all() (dead, referenced nonexistent compressor_reduce_quant.cu)
This commit is contained in:
2026-06-02 19:27:28 +00:00
parent f3b551956d
commit 35dbb8d12b

View File

@@ -137,57 +137,43 @@ One pass, one kernel. No two-loop epilogue, no LSE arithmetic in the merge. This
```
dsv4/
├── kernels/ Pure GPU code (CuTeDSL @cute.jit, .cu files)
│ ├── attention/ FMHA — FmhaKernel (hd=64/128/256 proven, hd=512 MLIR-blocked)
├── kernels/ Pure GPU code
│ ├── attention/ Production FMHA — 6-warp TMA multi-tile (.cuh + C-API .cu + op.py + production.py)
│ │ production.py is the entry point used by single_shot_inference.py
│ ├── gemm/ NVFP4 MoE GEMM (grouped, fused_swiglu, dense, scheduler)
│ ├── compressor/ CSA/HCA token-level compressor (CuTeDSL)
│ ├── indexer/ CSA indexer score+topk (FP32 scalar today; tensor-core FP4 on roadmap)
│ ├── router/ Dense router decode kernel (warp-specialized persistent GEMM)
│ ├── cache/ append_swa (writes KV to state cache)
── decode/ Decode-time attention (future)
│ └── cuda/ Raw .cu (deinterleave_quantize, sparse_topk_metadata, etc.)
│ ├── compressor/ CSA/HCA production compressor (production_compress.py → compressor_reduce.cu)
│ ├── indexer/ CSA indexer (stub; live path is inline in single_shot_inference.py)
│ ├── router/ Dense router decode + activation_topk
│ ├── cuda/ Raw .cu kernels (loader.py compiles on demand)
── cache/ (stub; SWA/flush kernels are in cuda/)
├── ops/ PyTorch ↔ kernel bridges
│ ├── quantize.py BF16 ↔ NVFP4, scale factor handling
│ ├── quantize.py BF16 ↔ NVFP4, scale factor handling, QuantizedActivation
│ ├── layouts.py Scale swizzle, gate/up interleave, K-major, offsets
│ ├── gemm_runner.py Warmup, compile, run grouped/fused GEMMs
│ ├── custom_ops.py torch.library.custom_op registrations
│ ├── decode_sparse.py native_sparse_decode dispatcher
── rope.py Forward + inverse RoPE (partial, last 64 dims)
│ ├── topk.py Sparse top-k metadata wrapper
│ └── router.py Router op bridge
├── layers/ nn.Module-style components
│ ├── rope_cuda.py Forward + inverse RoPE (partial, last 64 dims)
── router.py Router op bridge (dense + hash dispatch)
├── layers/ nn.Module-style components (used by single_shot_inference.py)
│ ├── linear.py Nvfp4Linear
│ ├── grouped_linear.py Nvfp4GroupedLinear (output projection)
│ ├── moe.py Nvfp4MoE (routed experts)
│ ├── shared_expert.py Nvfp4SharedExpert
│ ├── mhc.py mHCLayer (Sinkhorn-Knopp, residual mixing)
── attention.py AttentionSubBlock (CSA/HCA/SWA variants by LayerSpec)
│ ├── norm.py RMSNorm
│ ├── router.py Router (dense + hash modes)
│ ├── embedding.py Token embedding + mHC init
│ └── ffn.py FFN sub-block
├── model/ Model assembly
── router.py Router (dense + hash modes)
├── model/
│ ├── config.py DSV4Config
── layer.py TransformerLayer
│ ├── layer_schedule.py LayerSpec, AttentionType, build_schedule, validate_schedule
── mtp.py Multi-token prediction
│ ├── sampler.py Token sampler
│ └── dsv4.py Full model
├── cache/ KV cache infra
│ ├── allocator.py Memory allocator
│ ├── block_table.py Paged cache block table
│ ├── manager.py Cache manager
│ ├── paged_cache.py Classical paged cache (CSA/HCA)
│ ├── state_cache.py State cache (SWA + uncompressed tail)
│ ├── schema.py, handle.py, flush.py, prepare_forward.py
├── loader/ Checkpoint I/O
│ ├── hf_checkpoint.py
│ └── layout_convert.py
└── reference/ Slow PyTorch oracles (never imported by production code)
├── attention.py, csa_attention.py, compressor.py, moe_pipeline.py
── sampler.py CUDASampler
├── reference/
── single_shot_PYTORCH_REFERENCE.py PyTorch oracle for layer comparison tests
└── _archive/ Dead Lineage P code (model/dsv4.py, cache/*, layers/{attention,ffn,norm,embedding}, etc.)
Kept for reference; never imported by live code
```
**Dependency arrow:** `kernels/``ops/``layers/``model/`. `reference/` and `loader/` are sidecars.
**Live path:** `single_shot_inference.py``dsv4/layers/*``dsv4/ops/*``dsv4/kernels/**`
**Attention path:** `production.py``fmha_multitile_op.py``fmha_multitile_capi.cu``fmha_6warp_tma_multirow_multitile.cuh`
**Archived (Lineage P):** `dsv4/model/dsv4.py`, `dsv4/cache/*`, `dsv4/layers/{attention,ffn,norm,embedding}` — these were the vLLM/sglang integration surface but have 0 importers. See `_archive/` if needed.
---