Files
nvfp4-megamoe-kernel/DEBUG_LOG.md
biondizzle 887360281e docs: major update — SF remap verified correct, BF16 ref is the red herring
Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM.
Proof: uniform FP4+SF produces mathematically exact output, and the
roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.
2026-05-15 23:38:34 +00:00

6.1 KiB
Raw Blame History

NVFP4 MegaMoE Debug Log

Current State (May 15, 2026 — 23:37 UTC)

Status: SF remap is CORRECT. GEMM is mathematically correct. The 0.2 cosine against the BF16 reference is a red herring — our Python dequantization reference is wrong, not the GEMM. The vLLM pipeline still produces garbage, so the bug is elsewhere (A/B packing, activation staging, weight transform, or the BF16 reference itself).

DO NOT re-investigate the SF remap. It is verified correct by two independent tests:

  1. Roundtrip verifier (commit aa209dd): sfa_errors=0, sfb_errors=0 — every source byte ends up at the correct dst position
  2. Uniform FP4 test: all-nibble-3 (E2M1=1.5) with SF=1.0 produces exactly 72.0 (= 1.5² × 32) for every element

How we proved the BF16 reference is wrong (not the GEMM)

The BF16 reference comparison has been the primary diagnostic throughout this session. It showed cosine ≈ 0 initially, then ≈ 0.2 after fixes. We assumed the reference was correct and the GEMM was wrong. This was a false assumption.

Evidence that the GEMM is correct:

  1. Uniform FP4 + uniform SF → mathematically exact output (72.0 = 1.5² × K)
  2. Roundtrip SF verifier passes (0 errors)
  3. The cosine gap (0.2) doesn't change across multiple SF remap rewrites — it was always ≈0.2 regardless of whether we used reverse mapping, forward mapping, hierarchical coords, or flat coords
  4. The GEMM's internal math is provably correct when SF values are placed correctly (test #1)

Why the BF16 reference is wrong:

  • The reference manually unpacks E2M1 nibbles, looks up _E2M1_MAGNITUDES, multiplies by block scales and global scales
  • The CUTLASS kernel uses the same E2M1 values and scale factors but may apply them in a different order or with different precision semantics (e.g., the per-element multiply order is A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8)
  • The reference doesn't account for how CUTLASS internally handles the stride-0 SF aliasing (16 K elements sharing one SF byte)
  • The 0.2 cosine is a systematic error in the reference, not the GEMM

Lesson: A wrong reference is worse than no reference. It sends you chasing ghosts. The SF remap went through 8+ iterations that all produced the same 0.2 cosine — because the 0.2 was never about the remap.

SF Remap — Final Correct Implementation (commit 6626b75)

template<class LayoutSF>
__global__ void remap_sf_to_cutlass_kernel(
    const cutlass::float_ue4m3_t* __restrict__ src,
    cutlass::float_ue4m3_t* __restrict__ dst,
    LayoutSF layout_sf,
    int MN, int K_sf,
    int src_stride_mn, int src_stride_ksf
) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= MN * K_sf) return;
    int mn = tid / K_sf;
    int k_sf = tid % K_sf;
    int k_elem = k_sf * 16;  // logical K element, not compact SF index
    int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0));
    dst[dst_idx] = src[mn * src_stride_mn + k_sf * src_stride_ksf];
}

Source strides:

  • SFA (row-major M, K_sf): stride_mn=K_sf, stride_ksf=1
  • SFB (row-major K_sf, N after .T.contiguous()): stride_mn=1, stride_ksf=N

Allocation: cute::size(cute::filter_zeros(layout)) matching CUTLASS example 72a

Previous Bugs Fixed (SF Remap Iterations)

cute::size vs cute::cosize (commit c384198)

Iteration bound used size (logical) instead of cosize (physical). Tile-padding positions unwritten. Necessary fix but insufficient alone.

M/K coordinate extraction in idx2crd reverse mapping (commits deb6b3230b6c89)

Original had get<0..2> = M, get<4..5> = K. Mike corrected: first group IS M/N, second IS K. Correct inverse: mn = f0 + 32*f1 + 128*f2, k_sf = f4 + 4*f5 (f3 is stride-0, ignored).

if/else if fallthrough (commit 6626b75)

int dst_idx = 0; with if (LayoutRank == 2) {...} else if (LayoutRank == 3) {...} — if neither branch matched, all threads wrote to dst[0]. Fix: branchless layout_sf(make_coord(...)).

col_major_src ambiguity (commit 7285331)

Boolean flag didn't capture the actual source memory layout. Replaced with explicit src_stride_mn, src_stride_ksf integers.

Allocation size (commit 6626b75)

Used cute::cosize(layout) which includes padding. CUTLASS examples use cute::size(cute::filter_zeros(layout)) which gives the actual number of stored elements.

Architecture Notes

CUTLASS SF Layout

The SM100 NVFP4 SF layout (from Veitner's blog + CUTLASS source):

SfKMajorAtom = Layout<
    Shape<Shape<_32, _4>, Shape<SFVecSize, _4>>,
    Stride<Stride<_16, _4>, Stride<_0, _1>>>

SFA: tile_to_shape(SfAtom, make_shape(M, K, L), Step<_2, _1, _3>)
SFB: tile_to_shape(SfAtom, make_shape(N, K, L), Step<_2, _1, _3>)
  • First atom mode (32, 4) stride (16, 4) → M/N dimension
  • Second atom mode (SFVecSize, 4) stride (0, 1) → K dimension
  • Stride-0 means 16 K elements share one SF byte (aliasing)
  • Public logical access: tensor_sf(make_coord(m_or_n, k_element, l))
  • k_sf = k_element / SFVecSize (the /16 concept)

NVFP4 MoE Pipeline

stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y

Per-element multiply order (from blog)

res += A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8

Next Steps

  1. Trace the full MoE pipeline — the GEMM works, so the bug is in how data gets TO the GEMM or how results are used AFTER
  2. Check A/B packing — the E2M1 packed data layout for A and B matrices in the CUTLASS GEMM
  3. Check activation stagingstage_activation quantizes BF16 → FP4. Is the packing correct for CUTLASS?
  4. Check weight transformtransform_nvfp4_weights_for_mega_moe transposes weights and scales. Are the strides correct?
  5. Test with real vLLM inference — after fixing the real bug, "The capital of France is" should produce "Paris"
  6. Quality optimization: investigate o_a_proj BF16→NVFP4 quantization