Cleanup Part 2: Fix docs, stale references, dead code

- Update README.md package structure to match actual file tree - Remove references to nonexistent fmha.py, fmha_smem_acc, kernels/decode/ - Document live attention path: production.py → fmha_multitile_op → capi.cu → .cuh - Add _archive/ section - Fix loader.py docstring: fused_amax_quantize_nvfp4 → quantize_nvfp4_from_buffer - Remove preload_all() (dead, referenced nonexistent compressor_reduce_quant.cu)
2026-06-02 19:27:28 +00:00
parent f3b551956d
commit 35dbb8d12b
1 changed files with 24 additions and 38 deletions
--- a/README.md
+++ b/README.md
@@ -137,57 +137,43 @@ One pass, one kernel. No two-loop epilogue, no LSE arithmetic in the merge. This

 ```
 dsv4/
-├── kernels/          Pure GPU code (CuTeDSL @cute.jit, .cu files)
-│   ├── attention/      FMHA — FmhaKernel (hd=64/128/256 proven, hd=512 MLIR-blocked)
+├── kernels/          Pure GPU code
+│   ├── attention/      Production FMHA — 6-warp TMA multi-tile (.cuh + C-API .cu + op.py + production.py)
+│   │                     production.py is the entry point used by single_shot_inference.py
 │   ├── gemm/           NVFP4 MoE GEMM (grouped, fused_swiglu, dense, scheduler)
-│   ├── compressor/     CSA/HCA token-level compressor (CuTeDSL)
-│   ├── indexer/        CSA indexer score+topk (FP32 scalar today; tensor-core FP4 on roadmap)
-│   ├── router/         Dense router decode kernel (warp-specialized persistent GEMM)
-│   ├── cache/          append_swa (writes KV to state cache)
-│   ├── decode/         Decode-time attention (future)
-│   └── cuda/           Raw .cu (deinterleave_quantize, sparse_topk_metadata, etc.)
+│   ├── compressor/     CSA/HCA production compressor (production_compress.py → compressor_reduce.cu)
+│   ├── indexer/        CSA indexer (stub; live path is inline in single_shot_inference.py)
+│   ├── router/         Dense router decode + activation_topk
+│   ├── cuda/           Raw .cu kernels (loader.py compiles on demand)
+│   └── cache/          (stub; SWA/flush kernels are in cuda/)
 ├── ops/              PyTorch ↔ kernel bridges
-│   ├── quantize.py      BF16 ↔ NVFP4, scale factor handling
+│   ├── quantize.py      BF16 ↔ NVFP4, scale factor handling, QuantizedActivation
 │   ├── layouts.py       Scale swizzle, gate/up interleave, K-major, offsets
 │   ├── gemm_runner.py   Warmup, compile, run grouped/fused GEMMs
 │   ├── custom_ops.py    torch.library.custom_op registrations
-│   ├── decode_sparse.py native_sparse_decode dispatcher
-│   ├── rope.py          Forward + inverse RoPE (partial, last 64 dims)
-│   ├── topk.py          Sparse top-k metadata wrapper
-│   └── router.py        Router op bridge
-├── layers/           nn.Module-style components
+│   ├── rope_cuda.py     Forward + inverse RoPE (partial, last 64 dims)
+│   └── router.py        Router op bridge (dense + hash dispatch)
+├── layers/           nn.Module-style components (used by single_shot_inference.py)
 │   ├── linear.py        Nvfp4Linear
 │   ├── grouped_linear.py Nvfp4GroupedLinear (output projection)
 │   ├── moe.py           Nvfp4MoE (routed experts)
 │   ├── shared_expert.py Nvfp4SharedExpert
 │   ├── mhc.py           mHCLayer (Sinkhorn-Knopp, residual mixing)
-│   ├── attention.py     AttentionSubBlock (CSA/HCA/SWA variants by LayerSpec)
-│   ├── norm.py          RMSNorm
-│   ├── router.py        Router (dense + hash modes)
-│   ├── embedding.py     Token embedding + mHC init
-│   └── ffn.py           FFN sub-block
-├── model/            Model assembly
+│   └── router.py        Router (dense + hash modes)
+├── model/
 │   ├── config.py        DSV4Config
-│   ├── layer.py         TransformerLayer
-│   ├── layer_schedule.py LayerSpec, AttentionType, build_schedule, validate_schedule
-│   ├── mtp.py           Multi-token prediction
-│   ├── sampler.py       Token sampler
-│   └── dsv4.py          Full model
-├── cache/            KV cache infra
-│   ├── allocator.py     Memory allocator
-│   ├── block_table.py   Paged cache block table
-│   ├── manager.py       Cache manager
-│   ├── paged_cache.py   Classical paged cache (CSA/HCA)
-│   ├── state_cache.py   State cache (SWA + uncompressed tail)
-│   ├── schema.py, handle.py, flush.py, prepare_forward.py
-├── loader/           Checkpoint I/O
-│   ├── hf_checkpoint.py
-│   └── layout_convert.py
-└── reference/        Slow PyTorch oracles (never imported by production code)
-    ├── attention.py, csa_attention.py, compressor.py, moe_pipeline.py
+│   └── sampler.py       CUDASampler
+├── reference/
+│   └── single_shot_PYTORCH_REFERENCE.py  PyTorch oracle for layer comparison tests
+└── _archive/         Dead Lineage P code (model/dsv4.py, cache/*, layers/{attention,ffn,norm,embedding}, etc.)
+                      Kept for reference; never imported by live code
 ```

-**Dependency arrow:** `kernels/` → `ops/` → `layers/` → `model/`. `reference/` and `loader/` are sidecars.
+**Live path:** `single_shot_inference.py` → `dsv4/layers/*` → `dsv4/ops/*` → `dsv4/kernels/**`
+
+**Attention path:** `production.py` → `fmha_multitile_op.py` → `fmha_multitile_capi.cu` → `fmha_6warp_tma_multirow_multitile.cuh`
+
+**Archived (Lineage P):** `dsv4/model/dsv4.py`, `dsv4/cache/*`, `dsv4/layers/{attention,ffn,norm,embedding}` — these were the vLLM/sglang integration surface but have 0 importers. See `_archive/` if needed.

 ---