**Root cause:** The SF (scale factor) remap kernel in `cutlass_nvfp4_gemm.cu` used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. The `size` returns the logical size; `cosize` returns the physical size including tile padding. The destination buffer was allocated with `cosize` elements (correct) and zero-initialized, but the kernel only iterated over `size` elements (incorrect), leaving tile-padding positions as zero instead of their actual SF values.
**Why it was invisible in the all-ones test:** When all SF values are identical (uniform data), missing writes don't matter — every position should have the same value, and the ones that got written have the right one. The standalone test from the previous session used a single global scale for all blocks, producing uniform SF, which is why it showed cosine 1.0.
**Why it broke with real data:** Different blocks have different SF values. The tile-padding positions in the CUTLASS interleaved SF layout need specific SF values, but they were left as zero. CUTLASS reads those positions during the GEMM, getting zero scales instead of the correct values, which scrambles the output direction while preserving approximate magnitude.
- Deterministic prompt "The capital of France is" → `-W'MSG173 ~SB…abych` instead of "Paris"
- No NaN/Inf, magnitudes reasonable, but cosine similarity ≈ 0 between NVFP4 GEMM and BF16 reference
## How We Found It
### Step 1: Pipeline trace
Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.
### Step 2: BF16 reference comparison
Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. **Result: cosine ≈ 0** across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.
The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.
### Step 4: Found the bug
The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used `cosize` correctly. But the **iteration bound** in the remap kernel (line 128) used `size`. One line we missed when we previously audited size→cosize.
Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.
Investigated but ruled out. Both `stage_activation` and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.
**Status: Deferred.** The checkpoint has `o_a_proj.weight` as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.
**Status: CONFIRMED.** Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.
### 8. ✅ CUTLASS SF remap `size` vs `cosize` bug — ROOT CAUSE
**Status: FIXED (commit `c384198`).** The SF remap kernel iterated over `cute::size()` (logical) instead of `cute::cosize()` (physical with tile padding). Tile-padding positions in the CUTLASS interleaved SF layout were never written and stayed zero. With uniform SF (all-ones test) the bug was invisible. With non-uniform SF (real data) it produced cosine ≈ 0.
**Bug:** In `cutlass_nvfp4_gemm.cu` line 128, the SF remap kernel used `cute::size(layout_sf)` as the iteration bound instead of `cute::cosize(layout_sf)`. The `size` returns the logical element count; `cosize` returns the physical size including tile padding. The destination buffer was correctly allocated with `cosize` elements and zero-initialized, but the kernel only wrote to `size` positions, leaving tile-padding positions as zero.
**Why it was missed in the previous audit:** We changed all *allocation* sites from `size` to `cosize` (lines 179, 180, 232, 246, 287). The comment on lines 114-115 explicitly warned about this. But the *iteration bound* in the remap kernel itself (line 128) was overlooked — it was a different context (kernel launch parameter, not buffer allocation).
**Why the standalone test passed:** The previous standalone test used a single global scale for all blocks, producing uniform SF values. When all SF values are identical, missing writes don't matter — every position gets the same value regardless of which positions are written. The all-ones test in this session (M=1, N=32, K=32, cosine=1.0) confirmed this.