Files
DeepGEMM/deep_gemm
biondizzle 26a8ab75a1 NVFP4: fix SF pipeline — 2 K-cols per BLOCK_K for group=16
- TMA: issue two tma::copy calls per K-block (K_box=1, 2 SF K-columns)
- UTCCP: double loop for 2 K-columns, correct SMEM offsets
- TMEM: double SFA/SFB column counts (SF_BLOCK_M/32 * 2)
- Heuristic: fix smem_size (2× SF, packed FP4 A/B, packed send buffers, no amax)
- Staging kernel: fix double-count bug in packed_k_mask
2026-05-12 08:08:17 +00:00
..