- fmha_multihead_capi.cu: pure C API wrapper, no ATen/pybind11 deps - fmha_multihead_op.py: nvcc precompile + ctypes load (sm_100a) - Removed fmha_multihead_launch.cu (ATen approach didn't work) - Updated test to call kernel directly via ctypes API
- fmha_multihead_capi.cu: pure C API wrapper, no ATen/pybind11 deps - fmha_multihead_op.py: nvcc precompile + ctypes load (sm_100a) - Removed fmha_multihead_launch.cu (ATen approach didn't work) - Updated test to call kernel directly via ctypes API