cute.arch.fmin/fmax take scalar Float32, not TensorSSA. Replace with cute.where() and arithmetic for TensorSSA compatibility. Also changed subtile loop to unroll=1 for cute.where() compatibility.
cute.arch.fmin/fmax take scalar Float32, not TensorSSA. Replace with cute.where() and arithmetic for TensorSSA compatibility. Also changed subtile loop to unroll=1 for cute.where() compatibility.