|
|
36b439ee26
|
feat: NVFP4 mega MoE kernel (scale_vec::4X, UE4M3 block scales)
- New CUDA kernel: sm100_fp8_nvfp4_mega_moe_impl
- kGranK=16 (NVFP4 group_size=16, vs MXFP4's 32)
- kind::mxf4nvf4.block_scale.scale_vec::4X PTX instruction
- float_ue4m3_t scale factor type in instruction descriptor
- SF layout: scale_vec::4X (4 TMEM sub-columns per UMMA atom)
- UTCCP column stride: i*8 (vs MXFP4's i*4) for 4X layout
- L1 epilogue: UE4M3 activation scales (float→cutlass::float_e4m3_t)
- SF loading: kNumSFUint32 = kHidden/64 (4 UE4M3 per int32)
- New PTX wrappers: SM100_MMA_MXF4NVF4_2x1SM_SS, SM100_MMA_MXF4NVF4_SS
- Python API:
- fp8_nvfp4_mega_moe() with recipe=(1,1,16)
- transform_nvfp4_weights_for_mega_moe() for UE4M3→int32 UTCCP packing
- _pack_nvfp4_sf_for_utccp() helper
- C++ bindings:
- mega_nvfp4.hpp with NVFP4-specific SymmBuffer (SF stride K/16)
- JIT kernel header with kGranK=16 TMA descriptors
- Registered in python_api.cpp
NOTE: Both SFA and SFB must use UE4M3 (scale_format_ is 1-bit, shared).
The L1 epilogue converts float→UE4M3 for activation scales.
|
2026-05-11 05:41:08 +00:00 |
|
Zhean Xu
|
891d57b4db
|
Add various optimizations and Mega MoE benchmarks (#316)
* Merge with private repo
* Add Mega MoE Benchmark
* Minor fix
* Update
---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
|
2026-04-24 18:41:37 +08:00 |
|
Chenggang Zhao
|
7f2a703ed5
|
[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304)
* Merge with private repo
* Update README
* Update README
* Update README
* Add PyTorch requirements
* Fix sync scopes for MQA logits (#256)
* Update README
|
2026-04-17 09:45:14 +08:00 |
|
Zhean Xu
|
0f5f266202
|
Multiple updates and refactorings (#280)
|
2026-01-16 17:06:52 +08:00 |
|
Ray Wang
|
38f8ef73a4
|
Multiple updates and refactorings (#231)
|
2025-11-21 17:49:47 +08:00 |
|
Chenggang Zhao
|
8da33d6bd9
|
Clean up
|
2025-11-19 11:00:55 +08:00 |
|
Guoteng
|
f63d7f24d6
|
fix: prevent int32 overflow in k-grouped GEMM size calculations (#226)
|
2025-11-19 10:52:08 +08:00 |
|
Simon Mo
|
59f2c07cf2
|
Add SM100 kernels (#201)
Signed-off-by: simon-mo <simon.mo@hey.com>
|
2025-09-29 17:07:28 +08:00 |
|
Chenggang Zhao
|
80ceeb2c76
|
Add SM90 kernels (#200)
|
2025-09-29 17:00:23 +08:00 |
|
Ray Wang
|
3f71de7aa9
|
Make various updates and fixes (#198)
|
2025-09-25 16:19:07 +08:00 |
|
Ray Wang
|
f85ec649d7
|
Make various updates and fixes: (#164)
- Add BF16 support for SM90 and SM100
- Refactor Python APIs
- Other fixes and code refactoring
|
2025-08-15 18:32:35 +08:00 |
|