nvfp4-megamoe-kernel

Files

biondizzle e0f60b9f05 Fix fused router: plain ints for mma_tiler + @cute.jit pattern

Root cause of previous crash: cutlass.Int32(128) wrapping of mma_inst_shape_mn
caused _unpack_x_tuple to fail in cute.size(tiled_mma.shape_mnk, mode=[2]).

The fused_swiglu kernel uses plain Python ints for mma_tiler_mnk and
mma_inst_shape_mn — NOT cutlass.Int32. Inside @cute.jit, CuTeDSL
auto-converts plain ints to MLIR values. The Int32 wrapping was unnecessary
and actually harmful.

Pattern: same as fused_swiglu.py __call__:
- @cute.jit compiled_fn takes CuTe tensors
- _setup_attributes called inside JIT (needs MLIR context)
- cute.compile at the end

2026-06-01 10:37:15 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

Fix fused router: plain ints for mma_tiler + @cute.jit pattern

2026-06-01 10:37:15 +00:00

layers

Wire NVFP4 fused router kernel into e2e single-shot pipeline

2026-06-01 09:47:48 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering