Reference shows B stride=(16384,1,128) — K is stride-1 (K-major). Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().
Reference shows B stride=(16384,1,128) — K is stride-1 (K-major). Our stack produces N-major stride=(16384,128,1). Added .T.contiguous().