Tests the NVIDIA reference kernel with our quantization pipeline: 1. Quantize BF16 → NVFP4 (our stage_activation logic) 2. Pad and swizzle scale factors (to_blocked) 3. Run ScaledGroupedGemmKernel (2Dx3D scenario) 4. Compare against BF16 matmul reference Also adds cutedsl/moe.py module for the future pipeline integration.