- Add BF16 support for SM90 and SM100 - Refactor Python APIs - Other fixes and code refactoring
- Add support for legacy CUDA versions; now compatible with CUDA 12.3 and newer - Add support for NVRTC compilation - Other fixes and code refactoring