Added detailed SF remap section with the empirical coordinate dump table
showing flat_rank=8 decomposition. Documented all 5 bugs found/fixed,
the diagnostic trail (constant-scale test, single-element probes), and
the 6 verification probes confirming the extraction formula.
m = f0 + f1*32 + f2*128 (CuTe 'first sub varies fastest')
k_sf = f4 + f5*4
f3 is the Step<2> stride (degenerate, always=total), NOT a coordinate.
Previous formula (f3*2+f2)*128 was catastrophically wrong — mapped
everything to m=0 or m=huge.
Previous approach assumed rank 2-6, but actual rank is 8.
For R==8: 4 M sub-indices (inner_32, inner_4, tile_interleave, tile_m)
4 K sub-indices (inner_16, inner_4_k, tile_k_interleave, tile_k)
m = (f3*2 + f2)*128 + f0*4 + f1
k_sf = f5 + f6*4 (tentative, needs printf verification)
Added printf of all 8 flat values for first 3 indices.
Going back to the idx2crd approach which compiles and runs.
Added printf for flat_rank, MN, K_sf, and first coordinate extraction.
Handles ranks 2-6 with logical (m, k_sf) extraction.
This will tell us the actual flat_rank and whether our extraction is correct.
layout_sf(m, k_elem) with flat ints fails: Mismatched Ranks because
the layout shape is ((32,4), K_padded), not (M, K).
Decompose m into (inner_m, sub_m) = (m/4, m%4) to match the (32,4)
sub-shape, and pass as make_tuple(make_tuple(inner, sub), k_elem).
Removed dead code from old idx2crd approach. File is now clean:
- Source-iterating SF remap kernel with layout_sf(m, k_elem)
- Zero-init dest buffers before remap
- Proper extern C wrapping
Ripped out idx2crd + flatten + get<> approach entirely. New kernel
iterates over source indices (m, k_group) and uses layout_sf(m, k_elem)
to compute the CUTLASS destination offset. CuTe handles nested shape
decomposition internally — no rank inspection needed.
K coordinate is in element-space (k_group * SFVecSize) as the layout
expects. Iterates over groups (not every element) since all 16 elements
within a group share one SF byte — avoids 16x redundant writes.
Grid size based on source count (MN * K_sf), not dest buffer size.
The forward-map approach (src -> layout_sf(m, k)) failed because CuTe's
layout operator requires coordinates matching the nested shape rank, and
passing flat (int, int) to a ((32,4),K) shape triggers Mismatched Ranks.
New approach: iterate over CUTLASS dest indices, use idx2crd to get the
hierarchical coordinate, flatten it, then extract logical (m, k_sf) by
interpreting the flattened sub-coordinates correctly:
flat[0..2] = (inner_M, sub_M, tile_M) -> m = tile_M*128 + inner_M*4 + sub_M
flat[3..5] = (inner_K, sub_K, tile_K) -> k_sf = tile_K*4 + sub_K
(inner_K is within one SF group — same byte, so ignored for k_sf)
Previous bug: get<0> and get<1> of flatten gave (inner_M, sub_M) — both
M sub-indices. K information was never extracted, so only k_group=0 worked.
Dest buffer is zero-initialized so padding slots (where m >= MN or
k_sf >= K_sf) stay zero.
Two fixes:
1. CuTe layout uses element-space K, not group-space. k_group=3 with
SFVecSize=16 maps to k_elem=48 in the layout, not k=3.
Added SFVecSize param to remap kernel, multiply k_sf * SFVecSize
before passing to layout_sf().
2. Zero-init CUTLASS dest buffer before remap. The layout pads to
tile boundaries (128x64), so dest is larger than M*K_sf. Unmapped
padding slots reading garbage causes sporadic wrong results.
Also fixed grid size to use source count (M*K_sf), not dest size.
The remap kernel iterated over CUTLASS linear indices and tried to
reverse-map with idx2crd + flatten. But flatten() on the nested CuTe
coordinate (from tile_to_shape(SfAtom{}, ...)) gives atom-level
sub-indices, not logical (m, k). This caused all K-groups > 0 in SFA
to map to m*K_sf+0, losing K-group information entirely.
Proof: setting SFA[0,0]=2.0 changed row 0, but SFA[0,3]=2.0 produced
zero change. Only K-group 0 was being read.
Fix: iterate over SOURCE indices (row-major m, k) and use the CuTe
layout forward: layout_sf(make_coord(m, k)) -> CUTLASS dst index.
This is the correct forward direction that CuTe handles natively.
Constant-scale test (all SF=1.0) gave cosine=1.0, confirming the FP4
data path is correct. The bug was purely in the SF remap.
The logical_widths branch took expert 0 and 1's global scales and
applied them to ALL experts. For L1 with logical_widths=[3072,3072],
every expert got expert-0's scale on its gate half and expert-1's
scale on its up half. All other experts' global scales were discarded.
The else branch correctly broadcasts each expert's own (E,1) global
scale across (E, N, K//16). Removed the dead logical_widths code.
The interleave assumed gate/up were pre-interleaved in groups of 16
and that we needed 2CTA UMMA layout. Both wrong:
1. vLLM w13_weight is plain concat [gate; up] along output dim
2. Our CUTLASS kernel uses ClusterShape 1x1x1, not 2CTA
The interleave was shuffling weights into nonsense, making L1 GEMM
compute the wrong thing, and chunk(2) would split wrong halves.
sf_layout.py was a no-op (return sf) but the actual remap happens
in remap_sf_to_cutlass_kernel in cutlass_nvfp4_gemm.cu. Updated
sf_layout.py to pure reference docs so nobody gets confused again.
Three bugs fixed:
1. clamp(0,15) was destroying sign bits — E2M1 is sign-magnitude 4-bit
nibbles, not unsigned. Half the activation was zeroed.
2. Scale stored block_max but divided by block_max/6, so stored scale was
6× too large. Now correctly stores block_max/6 (the actual dequant factor).
3. Uniform 0.5 step doesn't match E2M1 values {0,0.5,1,1.5,2,3,4,6}.
Now snaps to nearest E2M1 representable magnitude.
New _quantize_to_e2m1 helper handles all three correctly:
- Sign-magnitude 4-bit nibble packing (bit3=sign, bits2:0=mag index)
- Correct block scale (block_max / 6.0)
- Nearest-neighbor to actual E2M1 values
Byte 0x3F was becoming float8(63.0) instead of the float8 whose bit
pattern IS 0x3F (~0.984). Pack uses .view() (correct), unpack used
.to() (wrong) — they were not inverses. This corrupted every activation
scale fed to the L1 GEMM while weight scales were fine.