75f1c8544b
fix: remove smem_inner_dim doubling for packed FP4 TMA — must match MMA row width (BLOCK_K/2)
2026-05-12 17:14:44 +00:00
49e5646b42
fix: remove duplicate kInt8 case — kPackedFP4 is already kInt8
...
kPackedFP4 = torch::kInt8, so the kInt8 case was a duplicate.
The real fix was in mega_nvfp4.hpp: changing kUInt8→kInt8 so
tensors match the existing kPackedFP4 path in the TMA switch.
2026-05-11 22:55:28 +00:00
80df24a641
fix: add kInt8 dtype support to TMA descriptor + change activation tensors to kInt8
...
- runtime_utils.hpp: added kInt8 -> CU_TENSOR_MAP_DATA_TYPE_UINT8 mapping
- mega_nvfp4.hpp: changed activation tensor dtypes from kUInt8 to kInt8
(same byte layout, but kInt8 is recognized by the TMA dtype switch)
2026-05-11 22:54:47 +00:00
Chenggang Zhao
7f2a703ed5
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes ( #304 )
...
* Merge with private repo
* Update README
* Update README
* Update README
* Add PyTorch requirements
* Fix sync scopes for MQA logits (#256 )
* Update README
2026-04-17 09:45:14 +08:00
Zhean Xu
0f5f266202
Multiple updates and refactorings ( #280 )
2026-01-16 17:06:52 +08:00
Ray Wang
38f8ef73a4
Multiple updates and refactorings ( #231 )
2025-11-21 17:49:47 +08:00
Chenggang Zhao
07b82fb8cd
Fix old CUDA compatibility
2025-10-01 20:29:15 +08:00
Ray Wang
3f71de7aa9
Make various updates and fixes ( #198 )
2025-09-25 16:19:07 +08:00
Ray Wang
9da4a23561
Add more GPU architectures support ( #112 )
...
* Add more GPU architectures support
* Update layout.py
* Optimize performance, Add SM90 support, Add 1D2D SM100 support
* Add fmtlib submodule at commit 553ec11
---------
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com >
2025-07-18 11:32:22 +08:00