Files
nvfp4-megamoe-kernel/DEBUG_LOG.md

7.4 KiB

NVFP4 MegaMoE Debug Log

Current State (May 15, 2026 — 23:00 UTC)

Status: SF remap rewritten as forward mapping with flat logical coordinates. Roundtrip verifier added. Awaiting rebuild and verification.

Key evolution of the SF remap fix:

  1. Original bug: idx2crd reverse mapping with wrong coordinate extraction (M/K swapped, stride-0 dimension mishandled)
  2. First fix attempt: Swap M/K groups — still cosine ≈ 0.2
  3. Second fix attempt: Forward mapping with layout_sf(make_coord(mn, k_sf * 16, 0)) — flat logical coordinates
  4. Critical bug found: if constexpr (LayoutRank == 2) { ... } else if (LayoutRank == 3) { ... } with int dst_idx = 0; fallthrough — when LayoutRank didn't match 2 or 3, all threads wrote to dst[0], leaving the rest zero. This broke even uniform SF.
  5. Current approach: Branchless layout_sf(make_coord(mn, k_sf * 16, 0)), explicit source strides, filter_zeros allocation, roundtrip verifier

What the forward mapping does

int mn = tid / K_sf;       // logical M/N index
int k_sf = tid % K_sf;     // compact SF group index
int k_elem = k_sf * 16;    // logical K element coordinate
int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0));
int src_idx = mn * src_stride_mn + k_sf * src_stride_ksf;
dst[dst_idx] = src[src_idx];
  • CuTe's layout_sf() accepts flat logical coordinates and internally maps them to the hierarchical natural coordinate before applying strides
  • k_sf * 16 converts from compact SF group index to logical K element index (the layout expects K element coords, not SF group coords)
  • make_coord(mn, k_elem, 0) is a 3-tuple matching the SFA/SFB layout shape (M/N, K, L)

Source strides

Tensor Physical layout src_stride_mn src_stride_ksf
SFA row-major (M, K_sf) K_sf 1
SFB row-major (K_sf, N) after .T.contiguous() 1 N

Allocation

Uses cute::size(cute::filter_zeros(layout)) matching CUTLASS example 72a, with cudaMemsetAsync zero-init before remap.

Roundtrip verifier (commit aa209dd)

Added check_sf_forward_kernel that compares src[src_idx] against dst[dst_idx] byte-by-byte. If this passes, the remap is correct. If it fails, the forward mapping is wrong. This is the key diagnostic — it separates remap correctness from other issues (A/B packing, FP8 type interpretation, stale code).

How We Got Here

Phase 1: Pipeline debugging

Added prints at every MoE stage. All magnitudes reasonable, no NaN. Signal present but buried. BF16 reference showed cosine ≈ 0 — the GEMM was wrong.

Phase 2: Isolating the GEMM

Standalone tests proved: all-ones → cosine 1.0, random → cosine ≈ 0. The SF remap was the culprit.

Phase 3: Blog post analysis

Veitner's blog (veitner.bearblog.dev) explained the CUTLASS SF layout construction in detail:

  • K-major atom: Shape<Shape<_32,_4>, Shape<SFVecSize,_4>>, Stride<Stride<_16,_4>, Stride<_0,_1>>
  • SFA tiled as make_shape(M, K, L), SFB as make_shape(N, K, L) with Step<_2,_1,_3>
  • First atom mode is M/N, second is K
  • f3 (stride-0 inner K) is within one SF group — always maps to offset 0
  • k_sf = logical_k / 16 (the /16 concept)
  • Public logical access: tensor_sfa(make_coord(m, k, 0))

Phase 4: Multiple failed coordinate extraction attempts

  • Original: get<0..2> = M, get<4..5> = K (wrong)
  • Swapped: get<0..2> = K, get<3..5> = M (still wrong — stride-0 ambiguity)
  • Mike's correction: mn = f0 + 32*f1 + 128*f2, k_sf = f4 + 4*f5 (correct inverse, but still ≈0.2 cosine — source stride issue)

Phase 5: Forward mapping

Switched from reverse (idx2crd over dst) to forward (iterate over src, compute dst via layout_sf()). Multiple bugs:

  • Hierarchical coordinate nesting was wrong (missing L component, wrong depth)
  • Flat make_coord(mn, k*16) didn't auto-decompose (contrary to expectation)
  • if/else if fallthrough wrote dst[0] for unmatched ranks, breaking uniform SF
  • Source strides: col_major_src bool was ambiguous — replaced with explicit src_stride_mn, src_stride_ksf
  • Allocation: cosize replaced with size(filter_zeros()) to match CUTLASS examples

Phase 6: Current approach (commit aa209dd)

Branchless forward mapping with flat logical coordinates, explicit strides, filter_zeros allocation, roundtrip verifier. This should be the correct implementation — the verifier will confirm.

Key Commits (recent)

Commit Description
c384198 Fix: SF remap uses cute::cosize() instead of cute::size()
deb6b32 Swap M/K in SF remap + add printf diagnostics
a09b9b5 Remove printf and diag function (build fix)
30b6c89 Correct k_sf = f4 + 4*f5 (Mike's patch)
63e67e1 Rewrite as forward mapping (source→dst)
f6fd549 Restore col_major_src handling
7285331 Replace col_major_src with explicit source strides
6fc8fa6 Use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0))
6626b75 Use filter_zeros for allocation + no-branch forward mapping
aa209dd Add SF remap roundtrip verifier

Architecture Notes

DeepSeek-V4 MoE Layer Forward Pass

residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x)                          ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)

residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x)                           ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)

NVFP4 MoE Pipeline

stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y

CUTLASS SF Layout (from blog + code review)

The SM100 NVFP4 SF layout is constructed from a K-major atom:

SfKMajorAtom = Layout<
    Shape<Shape<_32, _4>, Shape<SFVecSize, _4>>,
    Stride<Stride<_16, _4>, Stride<_0, _1>>>
  • First mode (32, 4) with stride (16, 4) → covers M/N dimension
  • Second mode (SFVecSize, 4) with stride (0, 1) → covers K dimension
  • stride=0 means all 16 values in an SF group map to the same byte
  • SFA tiled as make_shape(M, K, L) with Step<_2, _1, _3>
  • SFB tiled as make_shape(N, K, L) with Step<_2, _1, _3>
  • Public logical access: tensor_sf(make_coord(m_or_n, k_element, l))

Checkpoint Layers (layer 0)

  • MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
  • Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
  • o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
  • o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
  • Gate weight, norms, sinks, position_bias: BF16

Next Steps

  1. Check SF-VERIFY output — if sfa_errors=0 and sfb_errors=0, remap is correct
  2. If remap correct but cosine still low: issue is A/B packing, FP8 type interpretation, or stale code
  3. If remap has errors: debug the forward mapping further
  4. Run deterministic prompt — "The capital of France is" should produce "Paris"
  5. Once working: clean up debug prints, remove verifier
  6. Quality optimization: investigate o_a_proj BF16→NVFP4 quantization