nvfp4-megamoe-kernel

Author	SHA1	Message	Date
biondizzle	2ef71dc21a	fix: B tensor K-major strides, scale_b axis swap Two fixes: 1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major stride (16384,1,128) matching reference 2. scale_b: transpose to (N, K_sf) before swizzling — reference uses (intermediate, hidden//16) not (hidden//16, intermediate)	2026-05-16 03:04:31 +00:00
biondizzle	6294b84213	fix: B tensor must be K-major (transpose last 2 dims) Reference shows B stride=(16384,1,128) — K is stride-1 (K-major). Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().	2026-05-16 03:03:00 +00:00
biondizzle	7c882fe2e0	fix: correct weight quantization for CuTeDSL kernel Weight K dimension (hidden) must be the packed dimension, not N. Block scales computed along K dim. FP4 packing along K.	2026-05-16 02:58:55 +00:00
biondizzle	ca28f1335d	refactor: copy CuTeDSL kernel into repo with local imports Copied from CUTLASS examples (no more runtime dependency on /root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.* instead of blackwell.kernel.*. Structure: cutedsl/__init__.py cutedsl/kernel/__init__.py cutedsl/kernel/moe/ (the MoE scaled grouped GEMM) cutedsl/kernel/blockscaled_gemm/ (dense blockscaled GEMM) test_cutedsl.py updated to import from our local copy.	2026-05-16 02:57:54 +00:00
biondizzle	a3aa2d201e	fix: clarify import path setup for CuTeDSL	2026-05-16 02:55:25 +00:00
biondizzle	f951d284e7	test: add CuTeDSL NVFP4 GEMM test using reference ScaledGroupedGemmKernel Tests the NVIDIA reference kernel with our quantization pipeline: 1. Quantize BF16 → NVFP4 (our stage_activation logic) 2. Pad and swizzle scale factors (to_blocked) 3. Run ScaledGroupedGemmKernel (2Dx3D scenario) 4. Compare against BF16 matmul reference Also adds cutedsl/moe.py module for the future pipeline integration.	2026-05-16 02:55:04 +00:00

6 Commits