74dba6ab9d
auto: pre-test commit
2026-05-28 04:40:20 +00:00
43f0b5d1e8
D1.5: Fix O rescale with paired atoms (incremental approach)
...
Keep epilogue_tma_store for final output (proven path).
Only fix the multi-KV-tile O rescale using paired atoms from
epilogue_tmem_copy_and_partition. The paired atoms share addressing,
making the TMEM->REGS->modify->TMEM cycle lossless.
Guarded by const_expr(n_kv_tiles > 1) so single-tile path (n=128)
is completely unaffected — zero regression risk.
Full correction epilogue (one-way TMEM->REGS->SMEM->GMEM) deferred
until we can address the MLIR compilation time issue.
2026-05-26 19:34:26 +00:00
f97aee6eed
plan update
2026-05-26 19:00:22 +00:00
32850f6974
Update README, STAGE_D, STAGE_D2 with D1 rescale findings and D2 status
2026-05-25 01:18:48 +00:00
9435bf9653
Restore NVFP4 Precision Roadmap + add O rescale gap to D1.5
2026-05-24 21:48:58 +00:00
dadfad8f89
Docs: Update STAGE_D.md, README.md with hd=512 compilation blocker, lessons learned
2026-05-24 21:35:25 +00:00
6be7690011
Docs: Update STAGE_D.md, README.md status for D1 hd≤256 milestone
2026-05-24 04:32:43 +00:00
53efb0c95e
Update STAGE_D.md with D5b results: merge cos 0.961, LSE err=0.0
2026-05-23 21:45:22 +00:00
4ed2b46020
D5b MILESTONE: SWA+sink merge works! cos 0.969
...
- Run FMHA twice (compressed KV + SWA KV) with normalized O + LSE
- Merge with sink weights in Python
- LSE err=0.0, merge cos=0.969 PASS
- Update STAGE_D.md: D5b done, D5c/D5d are optimizations
2026-05-23 21:36:26 +00:00
a629babb6a
Update STAGE_D.md: D5a done, CG-2/CG-3 status updated, tOrP0 offset rule added
2026-05-23 21:16:52 +00:00
ee969d4c46
Update STAGE_D.md: manual SMEM addressing blocked on layout mapping
2026-05-23 19:22:28 +00:00
841db091f7
auto: pre-test commit
2026-05-23 19:20:42 +00:00
4d6acaeef0
auto: pre-test commit
2026-05-23 19:14:02 +00:00
dc2c9ffb92
Update STAGE_D.md with current action plan - starting NVFP4-0 verification and D1.3 validation on B200
2026-05-23 19:09:56 +00:00
f0f78b804c
📋 Update STAGE_D.md: D1.3 ✅ SOLVED, D1.4 ✅ IMPLEMENTED, D1.5 🟡 complex refactor, checklist updated
2026-05-23 18:37:53 +00:00
d995cd0c5c
🎉 Mark D1.3 as SOLVED! SMEM-P rank mismatch fixed, enables hd>64 support
2026-05-23 18:26:15 +00:00
a3659c581d
Update STAGE_D.md checklist with current progress and lessons learned
2026-05-23 09:27:48 +00:00
241b49b1ee
docs: add NVFP4 precision roadmap to STAGE_D.md (3 honest buckets + speculative bucket)
2026-05-23 07:39:09 +00:00
73fa8a2b70
shit carmine left dangling
2026-05-23 06:55:22 +00:00
bd2da14ca6
D1.2: TMEM budget verified on B200. Split-PV mandatory at hd=512 (MMA max N=256)
2026-05-23 06:43:01 +00:00
580d2f6999
STAGE_D.md: restructure with correctness gaps, TMEM budget, execution order
2026-05-23 06:31:37 +00:00
249a581d8a
Add STAGE_D.md: step-by-step runbook and todo list for D1-D5
2026-05-23 05:52:03 +00:00