Added detailed documentation of the packed FP4 architecture: - mxf4nvf4 reads packed (2 per byte), NOT unpacked like mxf8f6f4 - SMEM layout: float_e2m1_t, BLOCK_K/2 swizzle, UMMA desc byte math - L1 epilogue: st.shared.u16, no swizzle, kWarpBytesPerRow - Host TMA: hidden/2 K-dim, block_k/2 inner, fp4_unpacked_smem=false - Build history through Build 35
The build 17-18 'scale_vec not supported on sm_100f' error was because we targeted sm_100 instead of sm_100a. The 'a' suffix is required for FP4 block-scaled MMA instructions. Reverting to mxf4nvf4 with correct arch target is the path forward.