Copied from CUTLASS examples (no more runtime dependency on
/root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.*
instead of blackwell.kernel.*.
Structure:
cutedsl/__init__.py
cutedsl/kernel/__init__.py
cutedsl/kernel/moe/ (the MoE scaled grouped GEMM)
cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM)
test_cutedsl.py updated to import from our local copy.