Files
DeepGEMM/deep_gemm/include
biondizzle 091b974736 fix: L1 epilogue uses STSM with XOR swizzle for E2M1 FP4 output
Keep STSM (not naive SMEM write) so TMA reads correct bank layout.
Pack 4 E2M1 nibbles into uint32 per STSM atom with XOR swizzle.
Known perf note: 32B swizzle zone for L1 output (land for v1).
2026-05-11 20:57:34 +00:00
..