Commit Graph

6 Commits

Author SHA1 Message Date
2ef71dc21a fix: B tensor K-major strides, scale_b axis swap
Two fixes:
1. B tensor: permute(0,2,1).contiguous().permute(0,2,1) gives K-major
   stride (16384,1,128) matching reference
2. scale_b: transpose to (N, K_sf) before swizzling — reference uses
   (intermediate, hidden//16) not (hidden//16, intermediate)
2026-05-16 03:04:31 +00:00
6294b84213 fix: B tensor must be K-major (transpose last 2 dims)
Reference shows B stride=(16384,1,128) — K is stride-1 (K-major).
Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().
2026-05-16 03:03:00 +00:00
7c882fe2e0 fix: correct weight quantization for CuTeDSL kernel
Weight K dimension (hidden) must be the packed dimension, not N.
Block scales computed along K dim. FP4 packing along K.
2026-05-16 02:58:55 +00:00
ca28f1335d refactor: copy CuTeDSL kernel into repo with local imports
Copied from CUTLASS examples (no more runtime dependency on
/root/cutlass/examples/). Fixed all imports to use cutedsl.kernel.*
instead of blackwell.kernel.*.

Structure:
  cutedsl/__init__.py
  cutedsl/kernel/__init__.py
  cutedsl/kernel/moe/  (the MoE scaled grouped GEMM)
  cutedsl/kernel/blockscaled_gemm/  (dense blockscaled GEMM)

test_cutedsl.py updated to import from our local copy.
2026-05-16 02:57:54 +00:00
a3aa2d201e fix: clarify import path setup for CuTeDSL 2026-05-16 02:55:25 +00:00
f951d284e7 test: add CuTeDSL NVFP4 GEMM test using reference ScaledGroupedGemmKernel
Tests the NVIDIA reference kernel with our quantization pipeline:
1. Quantize BF16 → NVFP4 (our stage_activation logic)
2. Pad and swizzle scale factors (to_blocked)
3. Run ScaledGroupedGemmKernel (2Dx3D scenario)
4. Compare against BF16 matmul reference

Also adds cutedsl/moe.py module for the future pipeline integration.
2026-05-16 02:55:04 +00:00