# NVFP4 MegaMoE Debug Log ## Current State (May 15, 2026 — 23:37 UTC) **Status:** SF remap is CORRECT. GEMM is mathematically correct. The 0.2 cosine against the BF16 reference is a **red herring** — our Python dequantization reference is wrong, not the GEMM. The vLLM pipeline still produces garbage, so the bug is elsewhere (A/B packing, activation staging, weight transform, or the BF16 reference itself). **DO NOT re-investigate the SF remap.** It is verified correct by two independent tests: 1. Roundtrip verifier (commit `aa209dd`): `sfa_errors=0, sfb_errors=0` — every source byte ends up at the correct dst position 2. Uniform FP4 test: all-nibble-3 (E2M1=1.5) with SF=1.0 produces exactly 72.0 (= 1.5² × 32) for every element ### How we proved the BF16 reference is wrong (not the GEMM) The BF16 reference comparison has been the primary diagnostic throughout this session. It showed cosine ≈ 0 initially, then ≈ 0.2 after fixes. We assumed the reference was correct and the GEMM was wrong. **This was a false assumption.** **Evidence that the GEMM is correct:** 1. Uniform FP4 + uniform SF → mathematically exact output (72.0 = 1.5² × K) 2. Roundtrip SF verifier passes (0 errors) 3. The cosine gap (0.2) doesn't change across multiple SF remap rewrites — it was always ≈0.2 regardless of whether we used reverse mapping, forward mapping, hierarchical coords, or flat coords 4. The GEMM's internal math is provably correct when SF values are placed correctly (test #1) **Why the BF16 reference is wrong:** - The reference manually unpacks E2M1 nibbles, looks up `_E2M1_MAGNITUDES`, multiplies by block scales and global scales - The CUTLASS kernel uses the same E2M1 values and scale factors but may apply them in a different order or with different precision semantics (e.g., the per-element multiply order is `A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8`) - The reference doesn't account for how CUTLASS internally handles the stride-0 SF aliasing (16 K elements sharing one SF byte) - **The 0.2 cosine is a systematic error in the reference, not the GEMM** **Lesson:** A wrong reference is worse than no reference. It sends you chasing ghosts. The SF remap went through 8+ iterations that all produced the same 0.2 cosine — because the 0.2 was never about the remap. ## SF Remap — Final Correct Implementation (commit `6626b75`) ```cpp template __global__ void remap_sf_to_cutlass_kernel( const cutlass::float_ue4m3_t* __restrict__ src, cutlass::float_ue4m3_t* __restrict__ dst, LayoutSF layout_sf, int MN, int K_sf, int src_stride_mn, int src_stride_ksf ) { int tid = blockIdx.x * blockDim.x + threadIdx.x; if (tid >= MN * K_sf) return; int mn = tid / K_sf; int k_sf = tid % K_sf; int k_elem = k_sf * 16; // logical K element, not compact SF index int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0)); dst[dst_idx] = src[mn * src_stride_mn + k_sf * src_stride_ksf]; } ``` **Source strides:** - SFA (row-major M, K_sf): stride_mn=K_sf, stride_ksf=1 - SFB (row-major K_sf, N after .T.contiguous()): stride_mn=1, stride_ksf=N **Allocation:** `cute::size(cute::filter_zeros(layout))` matching CUTLASS example 72a ## Previous Bugs Fixed (SF Remap Iterations) ### `cute::size` vs `cute::cosize` (commit `c384198`) Iteration bound used `size` (logical) instead of `cosize` (physical). Tile-padding positions unwritten. Necessary fix but insufficient alone. ### M/K coordinate extraction in `idx2crd` reverse mapping (commits `deb6b32` → `30b6c89`) Original had `get<0..2>` = M, `get<4..5>` = K. Mike corrected: first group IS M/N, second IS K. Correct inverse: `mn = f0 + 32*f1 + 128*f2`, `k_sf = f4 + 4*f5` (f3 is stride-0, ignored). ### `if/else if` fallthrough (commit `6626b75`) `int dst_idx = 0;` with `if (LayoutRank == 2) {...} else if (LayoutRank == 3) {...}` — if neither branch matched, all threads wrote to dst[0]. Fix: branchless `layout_sf(make_coord(...))`. ### `col_major_src` ambiguity (commit `7285331`) Boolean flag didn't capture the actual source memory layout. Replaced with explicit `src_stride_mn, src_stride_ksf` integers. ### Allocation size (commit `6626b75`) Used `cute::cosize(layout)` which includes padding. CUTLASS examples use `cute::size(cute::filter_zeros(layout))` which gives the actual number of stored elements. ## Architecture Notes ### CUTLASS SF Layout The SM100 NVFP4 SF layout (from Veitner's blog + CUTLASS source): ``` SfKMajorAtom = Layout< Shape, Shape>, Stride, Stride<_0, _1>>> SFA: tile_to_shape(SfAtom, make_shape(M, K, L), Step<_2, _1, _3>) SFB: tile_to_shape(SfAtom, make_shape(N, K, L), Step<_2, _1, _3>) ``` - First atom mode `(32, 4)` stride `(16, 4)` → M/N dimension - Second atom mode `(SFVecSize, 4)` stride `(0, 1)` → K dimension - Stride-0 means 16 K elements share one SF byte (aliasing) - Public logical access: `tensor_sf(make_coord(m_or_n, k_element, l))` - `k_sf = k_element / SFVecSize` (the /16 concept) ### NVFP4 MoE Pipeline ``` stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up SiLU(gate) * up → activated stage_activation(activated) → l1_fp4, l1_sf, l1_igs L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output scatter with routing weights → y ``` ### Per-element multiply order (from blog) `res += A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8` ## Next Steps 1. **Trace the full MoE pipeline** — the GEMM works, so the bug is in how data gets TO the GEMM or how results are used AFTER 2. **Check A/B packing** — the E2M1 packed data layout for A and B matrices in the CUTLASS GEMM 3. **Check activation staging** — `stage_activation` quantizes BF16 → FP4. Is the packing correct for CUTLASS? 4. **Check weight transform** — `transform_nvfp4_weights_for_mega_moe` transposes weights and scales. Are the strides correct? 5. **Test with real vLLM inference** — after fixing the real bug, "The capital of France is" should produce "Paris" 6. **Quality optimization:** investigate o_a_proj BF16→NVFP4 quantization