Copied from CUTLASS examples (no more runtime dependency on
/root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.*
instead of blackwell.kernel.*.
Structure:
cutedsl/__init__.py
cutedsl/kernel/__init__.py
cutedsl/kernel/moe/ (the MoE scaled grouped GEMM)
cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM)
test_cutedsl.py updated to import from our local copy.
Tests the NVIDIA reference kernel with our quantization pipeline:
1. Quantize BF16 → NVFP4 (our stage_activation logic)
2. Pad and swizzle scale factors (to_blocked)
3. Run ScaledGroupedGemmKernel (2Dx3D scenario)
4. Compare against BF16 matmul reference
Also adds cutedsl/moe.py module for the future pipeline integration.