DeepGEMM

Author	SHA1	Message	Date
biondizzle	86a1263f44	fix: gran_k=16 in transform_sf + sm_100a arch for NVFP4 mega_moe - transform_sf_into_required_layout: add gran_k=16 branch for NVFP4 UE4M3 scales (4 per int32, group_size=16). Previously only handled 32/128. - get_arch: always return '100a' for SM100, never '100f'. The family variant lacks mxf4nvf4 (NVFP4 block-scaled MMA) support, causing 'scale_vec::4X not supported on sm_100f' errors. - transform_nvfp4_weights_for_mega_moe: fold weight_scale_2 into block scales, pack UE4M3→int32, transpose MN-major, call transform_sf_into_required_layout with gran_k=16.	2026-05-11 16:11:11 +00:00
biondizzle	fbdddaccf4	revert: restore mxf4nvf4/block16 code (correct path for sm_100a) Reverted to commit 36b439e's NVFP4 kernel code: - kGranK=16, mxf4nvf4.block_scale.scale_vec::4X - float_ue4m3_t instruction descriptor - Block16 SF layout (4X TMEM) - UE4M3 L1 epilogue - No UE4M3→UE8M0 conversion, no block16→block32 merge The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully on both sm_100 and sm_100f with CUDA 13.0. The previous build 17 error was likely from a different cause, not the arch flag. Python: reverted transform_nvfp4_weights_for_mega_moe to use pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.	2026-05-11 15:02:47 +00:00
biondizzle	dcebe033e2	fix: use scale_vec::2X (block32) for SM100 B200 compatibility scale_vec::4X (block16) requires SM103/SM120 (B300/GB300), not SM100 (B200). Revert to block32 with UE4M3 scales. Same TMEM layout as MXFP4 but with UE4M3 scale format instead of UE8M0. Changes: - kGranK: 16 → 32 - PTX: scale_vec::4X → scale_vec::2X - SF layout: same as MXFP4 (K/32, K/128 for int32 packed) - UTCCP: i8 → i4 (2X layout, same as MXFP4) - TMEM columns: same as MXFP4 (SF_BLOCK_M/32, SF_BLOCK_N/32) - Python: merge NVFP4 block16→block32 scales (max of adjacent pairs) - recipe: (1,1,16) → (1,1,32)	2026-05-11 08:36:59 +00:00
biondizzle	f98c1f7fd5	fix: add gran_k=16 (NVFP4) support to transform_sf_into_required_layout The C++ function only handled gran_k=32 and 128 (MXFP4/FP8). Added gran_k=16 for NVFP4 group_size=16 support.	2026-05-11 07:13:00 +00:00
Chenggang Zhao	7f2a703ed5	[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 ) * Merge with private repo * Update README * Update README * Update README * Add PyTorch requirements * Fix sync scopes for MQA logits (#256) * Update README	2026-04-17 09:45:14 +08:00
Zhean Xu	0f5f266202	Multiple updates and refactorings (#280 )	2026-01-16 17:06:52 +08:00
Ray Wang	38f8ef73a4	Multiple updates and refactorings (#231 )	2025-11-21 17:49:47 +08:00
Ray Wang	3f71de7aa9	Make various updates and fixes (#198 )	2025-09-25 16:19:07 +08:00
Ray Wang	f85ec649d7	Make various updates and fixes: (#164 ) - Add BF16 support for SM90 and SM100 - Refactor Python APIs - Other fixes and code refactoring	2025-08-15 18:32:35 +08:00

9 Commits