nvfp4-megamoe-kernel

Files

biondizzle 5c746bbdf2 fix: TensorSSA-compatible clamp in fused SwiGLU kernel

cute.arch.fmin/fmax take scalar Float32, not TensorSSA.
Replace with cute.where() and arithmetic for TensorSSA compatibility.
Also changed subtile loop to unroll=1 for cute.where() compatibility.

2026-06-02 08:15:46 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

fix: TensorSSA-compatible clamp in fused SwiGLU kernel

2026-06-02 08:15:46 +00:00

layers

Add set_fused_swiglu() method to Nvfp4MoE

2026-06-02 07:59:57 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

model

P0 COMPLETE: Eliminate ALL .item() CPU-GPU syncs from NVFP4 activation path

2026-06-01 21:05:03 +00:00

ops

Fix gsa_buffer shape mismatch for MoE (M>1 rows)

2026-06-01 21:33:59 +00:00

reference

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00