What changed:
- Moved fmha_backup_pre_epilog.py, fmha_backup_v2.py, fmha_smem_acc.py to archive/
- Deleted fmha.py.backup (git has history)
- Added detailed heredoc headers to ALL files documenting:
* WHAT WORKS and WHAT'S BROKEN
* WHY each limitation exists (CuTeDSL toolchain gaps)
* KEY INSIGHTS FOR NVIDIA (what CuTeDSL is missing)
* What each file unblocks if fixed
File status:
fmha.py — CuTeDSL FMHA, cos 0.999998, D1.5 workaround
fmha_common.cuh — Raw CUDA shared defs (BF16, TMEM ops)
fmha_sm100.cuh — Raw CUDA reference, cos 0.999999
fmha_epilogue_sm100.cuh — Raw CUDA TMEM epilogue, HANGS (needs debug)
fmha_sm100_launch.cu — PyTorch binding (JIT broken, nvcc works)
production.py — CuTeDSL production wrapper (partial)
archive/ — Historical backups with explanation headers