DeepGEMM

Files

biondizzle fbdddaccf4 revert: restore mxf4nvf4/block16 code (correct path for sm_100a)

Reverted to commit 36b439e's NVFP4 kernel code:
- kGranK=16, mxf4nvf4.block_scale.scale_vec::4X
- float_ue4m3_t instruction descriptor
- Block16 SF layout (4X TMEM)
- UE4M3 L1 epilogue
- No UE4M3→UE8M0 conversion, no block16→block32 merge

The mxf4nvf4.scale_vec::4X PTX instruction compiles successfully
on both sm_100 and sm_100f with CUDA 13.0. The previous build 17
error was likely from a different cause, not the arch flag.

Python: reverted transform_nvfp4_weights_for_mega_moe to use
pack_ue4m3_to_int32 with gran_k=16, no UE8M0 conversion.

2026-05-11 15:02:47 +00:00

ld_st.cuh

Add various optimizations and Mega MoE benchmarks (#316 )

2026-04-24 18:41:37 +08:00

tcgen05.cuh

revert: restore mxf4nvf4/block16 code (correct path for sm_100a)

2026-05-11 15:02:47 +00:00

tma.cuh

[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 )

2026-04-17 09:45:14 +08:00

utils.cuh

[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 )

2026-04-17 09:45:14 +08:00

wgmma.cuh

[Public release 26/04] Introducing Mega MoE, FP4 Indexer and other features/fixes (#304 )

2026-04-17 09:45:14 +08:00