update README: D3 and D4 status DONE

2026-05-26 10:56:57 +00:00
parent 24993428a2
commit 4656fa81f9
1 changed files with 2 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -147,8 +147,8 @@ Summary
 | C | ✅ COMPLETE | Real online softmax. Kernel outputs un-norm O + LSE (no TMEM round-trip). Migrated to `dsv4/kernels/attention/fmha.py` as `FmhaKernel`. |
 | D1 | 🟡 hd≤256 DONE | Parameterized HEAD_DIM. qk_mma_tiler fix (hd=64/128/256 cos 0.999998). hd=512 SMEM fits but MLIR compilation hangs (>3hr). External k_sub merge proven impossible. O rescale TMEM round-trip BROKEN (Ld32x32bOp/St32x32bOp corrupt data). Python KV merge workaround works. |
 | D2 | 🟡 Per-head DONE | Multi-query grid. Per-head launch works (cos 0.999998, n_h=1-64 hd=64, n_h=2-8 hd=128, n_h=2 hd=256). Multi-CTA grid blocked: `flat_divide` + `epilogue_tma_store` layout mismatch. Requires full tma_partition refactor into kernel. |
-| D3 | TODO | SWA sequence length mask (swa_lens per batch) |
-| D4 | TODO | Causal mask on SWA branch only |
+| D3 | ✅ DONE | SWA sequence length mask (in-kernel post-QK via tTMEM_LOADcS coordinates, swa_len Int32 scalar) |
+| D4 | ✅ DONE | Causal mask on SWA branch (k_coord > m_coord → -inf, combined with D3 via OR logic) |
 | D5 | 🟢 D5a+D5b DONE | D5a: normalize flag + LSE output (err=0.0). D5b: Python SWA+sink merge (cos 0.961). D5c/D5d: fused kernel merge TODO. |
 | E1-E7 | TODO | Production extraction (class, custom op, cache, cleanup) |
 | NVFP4-3 | ✅ DONE | `use_2cta_instrs` conditional in gemm_runner.py. 1.7-1.9× throughput at prefill shapes. |