Files
DeepGEMM/deep_gemm/include
biondizzle 680874d067 NVFP4 L1 epilogue: group_size=16 SF layout
- Single amax per warp (16 N-elements = 1 SF group, no warp-pair reduction)
- Single sf_val instead of sf.x/sf.y split
- All 4 warps write SF (k_idx = n_block_idx*4 + warp_idx_in_wg)
- Remove dead SMEM amax storage, reclaim barrier offset space
- Remove dead __syncwarp after register-local amax
2026-05-12 07:08:08 +00:00
..