diff --git a/README.md b/README.md index c1358dee..d43557ce 100644 --- a/README.md +++ b/README.md @@ -146,7 +146,7 @@ Summary | B | ✅ COMPLETE | QK → identity softmax → P@V pipeline (TMEM alias, KV-tile interleaving) | | C | ✅ COMPLETE | Real online softmax. Kernel outputs un-norm O + LSE (no TMEM round-trip). Migrated to `dsv4/kernels/attention/fmha.py` as `FmhaKernel`. | | D1 | 🟡 hd≤256 DONE | Parameterized HEAD_DIM. qk_mma_tiler fix (hd=64/128/256 cos 0.999998). hd=512 SMEM fits but MLIR compilation hangs (>3hr). External k_sub merge proven impossible. | -| D2 | TODO | Multi-query grid with head packing (128 Q heads, MQA) | +| D2 | 🟡 Per-head DONE | Multi-query grid. Per-head launch works (cos 0.999998, n_h=64 hd=64). Multi-CTA grid deferred (requires tma_partition refactor). | | D3 | TODO | SWA sequence length mask (swa_lens per batch) | | D4 | TODO | Causal mask on SWA branch only | | D5 | 🟢 D5a+D5b DONE | D5a: normalize flag + LSE output (err=0.0). D5b: Python SWA+sink merge (cos 0.961). D5c/D5d: fused kernel merge TODO. |