Files
nvfp4-megamoe-kernel/tests
biondizzle 696462f07a feat: TMA async load infrastructure for FMHA kernel
- fmha_tma.cuh: TMA descriptor creation, mbarrier helpers, cp.async.bulk.tensor.2d wrappers
- fmha_6warp_tma.cuh: TMA-integrated multirow kernel with async GMEM→SMEM loads
  - TMA loads Q, K, V tiles to row-major SMEM
  - Transposes to canonical K-major layout for MMA
  - Same softmax/epilogue as non-TMA kernel
- test_fmha_tma.cu: Test harness for TMA FMHA (HD=64 first)
2026-05-29 04:36:52 +00:00
..