Files

biondizzle 4f857d5f99 docs: major DEBUG_LOG update — forward mapping, verifier, full debug timeline

2026-05-15 23:02:30 +00:00

7.4 KiB

Raw Blame History

NVFP4 MegaMoE Debug Log

Current State (May 15, 2026 — 23:00 UTC)

Status: SF remap rewritten as forward mapping with flat logical coordinates. Roundtrip verifier added. Awaiting rebuild and verification.

Key evolution of the SF remap fix:

Original bug: idx2crd reverse mapping with wrong coordinate extraction (M/K swapped, stride-0 dimension mishandled)
First fix attempt: Swap M/K groups — still cosine ≈ 0.2
Second fix attempt: Forward mapping with layout_sf(make_coord(mn, k_sf * 16, 0)) — flat logical coordinates
Critical bug found: if constexpr (LayoutRank == 2) { ... } else if (LayoutRank == 3) { ... } with int dst_idx = 0; fallthrough — when LayoutRank didn't match 2 or 3, all threads wrote to dst[0], leaving the rest zero. This broke even uniform SF.
Current approach: Branchless layout_sf(make_coord(mn, k_sf * 16, 0)), explicit source strides, filter_zeros allocation, roundtrip verifier

What the forward mapping does

int mn = tid / K_sf;       // logical M/N index
int k_sf = tid % K_sf;     // compact SF group index
int k_elem = k_sf * 16;    // logical K element coordinate
int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0));
int src_idx = mn * src_stride_mn + k_sf * src_stride_ksf;
dst[dst_idx] = src[src_idx];

CuTe's layout_sf() accepts flat logical coordinates and internally maps them to the hierarchical natural coordinate before applying strides
k_sf * 16 converts from compact SF group index to logical K element index (the layout expects K element coords, not SF group coords)
make_coord(mn, k_elem, 0) is a 3-tuple matching the SFA/SFB layout shape (M/N, K, L)

Source strides

Tensor	Physical layout	src_stride_mn	src_stride_ksf
SFA	row-major (M, K_sf)	K_sf	1
SFB	row-major (K_sf, N) after `.T.contiguous()`	1	N

Allocation

Uses cute::size(cute::filter_zeros(layout)) matching CUTLASS example 72a, with cudaMemsetAsync zero-init before remap.

Roundtrip verifier (commit `aa209dd`)

Added check_sf_forward_kernel that compares src[src_idx] against dst[dst_idx] byte-by-byte. If this passes, the remap is correct. If it fails, the forward mapping is wrong. This is the key diagnostic — it separates remap correctness from other issues (A/B packing, FP8 type interpretation, stale code).

How We Got Here

Phase 1: Pipeline debugging

Added prints at every MoE stage. All magnitudes reasonable, no NaN. Signal present but buried. BF16 reference showed cosine ≈ 0 — the GEMM was wrong.

Phase 2: Isolating the GEMM

Standalone tests proved: all-ones → cosine 1.0, random → cosine ≈ 0. The SF remap was the culprit.

Phase 3: Blog post analysis

Veitner's blog (veitner.bearblog.dev) explained the CUTLASS SF layout construction in detail:

K-major atom: Shape<Shape<_32,_4>, Shape<SFVecSize,_4>>, Stride<Stride<_16,_4>, Stride<_0,_1>>
SFA tiled as make_shape(M, K, L), SFB as make_shape(N, K, L) with Step<_2,_1,_3>
First atom mode is M/N, second is K
f3 (stride-0 inner K) is within one SF group — always maps to offset 0
k_sf = logical_k / 16 (the /16 concept)
Public logical access: tensor_sfa(make_coord(m, k, 0))

Phase 4: Multiple failed coordinate extraction attempts

Original: get<0..2> = M, get<4..5> = K (wrong)
Swapped: get<0..2> = K, get<3..5> = M (still wrong — stride-0 ambiguity)
Mike's correction: mn = f0 + 32*f1 + 128*f2, k_sf = f4 + 4*f5 (correct inverse, but still ≈0.2 cosine — source stride issue)

Phase 5: Forward mapping

Switched from reverse (idx2crd over dst) to forward (iterate over src, compute dst via layout_sf()). Multiple bugs:

Hierarchical coordinate nesting was wrong (missing L component, wrong depth)
Flat make_coord(mn, k*16) didn't auto-decompose (contrary to expectation)
if/else if fallthrough wrote dst[0] for unmatched ranks, breaking uniform SF
Source strides: col_major_src bool was ambiguous — replaced with explicit src_stride_mn, src_stride_ksf
Allocation: cosize replaced with size(filter_zeros()) to match CUTLASS examples

Phase 6: Current approach (commit `aa209dd`)

Branchless forward mapping with flat logical coordinates, explicit strides, filter_zeros allocation, roundtrip verifier. This should be the correct implementation — the verifier will confirm.

Key Commits (recent)

Commit	Description
`c384198`	Fix: SF remap uses cute::cosize() instead of cute::size()
`deb6b32`	Swap M/K in SF remap + add printf diagnostics
`a09b9b5`	Remove printf and diag function (build fix)
`30b6c89`	Correct k_sf = f4 + 4*f5 (Mike's patch)
`63e67e1`	Rewrite as forward mapping (source→dst)
`f6fd549`	Restore col_major_src handling
`7285331`	Replace col_major_src with explicit source strides
`6fc8fa6`	Use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0))
`6626b75`	Use filter_zeros for allocation + no-branch forward mapping
`aa209dd`	Add SF remap roundtrip verifier

Architecture Notes

DeepSeek-V4 MoE Layer Forward Pass

residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x)                          ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)

residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x)                           ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)

NVFP4 MoE Pipeline

stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y

CUTLASS SF Layout (from blog + code review)

The SM100 NVFP4 SF layout is constructed from a K-major atom:

SfKMajorAtom = Layout<
    Shape<Shape<_32, _4>, Shape<SFVecSize, _4>>,
    Stride<Stride<_16, _4>, Stride<_0, _1>>>

First mode (32, 4) with stride (16, 4) → covers M/N dimension
Second mode (SFVecSize, 4) with stride (0, 1) → covers K dimension
stride=0 means all 16 values in an SF group map to the same byte
SFA tiled as make_shape(M, K, L) with Step<_2, _1, _3>
SFB tiled as make_shape(N, K, L) with Step<_2, _1, _3>
Public logical access: tensor_sf(make_coord(m_or_n, k_element, l))

Checkpoint Layers (layer 0)

MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
Gate weight, norms, sinks, position_bias: BF16

Next Steps

Check SF-VERIFY output — if sfa_errors=0 and sfb_errors=0, remap is correct
If remap correct but cosine still low: issue is A/B packing, FP8 type interpretation, or stale code
If remap has errors: debug the forward mapping further
Run deterministic prompt — "The capital of France is" should produce "Paris"
Once working: clean up debug prints, remove verifier
Quality optimization: investigate o_a_proj BF16→NVFP4 quantization

7.4 KiB Raw Blame History