nvfp4-megamoe-kernel

Files

biondizzle 4336de9372 attention/: Clean up folder, archive backups, add detailed status headers

What changed:
- Moved fmha_backup_pre_epilog.py, fmha_backup_v2.py, fmha_smem_acc.py to archive/
- Deleted fmha.py.backup (git has history)
- Added detailed heredoc headers to ALL files documenting:
  * WHAT WORKS and WHAT'S BROKEN
  * WHY each limitation exists (CuTeDSL toolchain gaps)
  * KEY INSIGHTS FOR NVIDIA (what CuTeDSL is missing)
  * What each file unblocks if fixed

File status:
  fmha.py                 — CuTeDSL FMHA, cos 0.999998, D1.5 workaround
  fmha_common.cuh         — Raw CUDA shared defs (BF16, TMEM ops)
  fmha_sm100.cuh          — Raw CUDA reference, cos 0.999999
  fmha_epilogue_sm100.cuh — Raw CUDA TMEM epilogue, HANGS (needs debug)
  fmha_sm100_launch.cu    — PyTorch binding (JIT broken, nvcc works)
  production.py           — CuTeDSL production wrapper (partial)
  archive/                — Historical backups with explanation headers

2026-05-28 07:01:33 +00:00

attention

attention/: Clean up folder, archive backups, add detailed status headers

2026-05-28 07:01:33 +00:00

cache

KV Cache: schema, allocator, pools, manager, append_swa kernel

2026-05-22 00:08:38 +00:00

compressor

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

cuda

fix quantize_nvfp4 kernel: use proven single-thread-per-CTA pattern from deinterleave_quantize.cu