**Status:** SF remap is CORRECT. GEMM is mathematically correct. The 0.2 cosine against the BF16 reference is a **red herring** — our Python dequantization reference is wrong, not the GEMM. The vLLM pipeline still produces garbage, so the bug is elsewhere (A/B packing, activation staging, weight transform, or the BF16 reference itself).
The BF16 reference comparison has been the primary diagnostic throughout this session. It showed cosine ≈ 0 initially, then ≈ 0.2 after fixes. We assumed the reference was correct and the GEMM was wrong. **This was a false assumption.**
3. The cosine gap (0.2) doesn't change across multiple SF remap rewrites — it was always ≈0.2 regardless of whether we used reverse mapping, forward mapping, hierarchical coords, or flat coords
4. The GEMM's internal math is provably correct when SF values are placed correctly (test #1)
- The reference manually unpacks E2M1 nibbles, looks up `_E2M1_MAGNITUDES`, multiplies by block scales and global scales
- The CUTLASS kernel uses the same E2M1 values and scale factors but may apply them in a different order or with different precision semantics (e.g., the per-element multiply order is `A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8`)
- The reference doesn't account for how CUTLASS internally handles the stride-0 SF aliasing (16 K elements sharing one SF byte)
- **The 0.2 cosine is a systematic error in the reference, not the GEMM**
**Lesson:** A wrong reference is worse than no reference. It sends you chasing ghosts. The SF remap went through 8+ iterations that all produced the same 0.2 cosine — because the 0.2 was never about the remap.
Original had `get<0..2>` = M, `get<4..5>` = K. Mike corrected: first group IS M/N, second IS K. Correct inverse: `mn = f0 + 32*f1 + 128*f2`, `k_sf = f4 + 4*f5` (f3 is stride-0, ignored).
`int dst_idx = 0;` with `if (LayoutRank == 2) {...} else if (LayoutRank == 3) {...}` — if neither branch matched, all threads wrote to dst[0]. Fix: branchless `layout_sf(make_coord(...))`.
Used `cute::cosize(layout)` which includes padding. CUTLASS examples use `cute::size(cute::filter_zeros(layout))` which gives the actual number of stored elements.