7.4 KiB
NVFP4 MegaMoE Debug Log
Current State (May 15, 2026 — 23:00 UTC)
Status: SF remap rewritten as forward mapping with flat logical coordinates. Roundtrip verifier added. Awaiting rebuild and verification.
Key evolution of the SF remap fix:
- Original bug:
idx2crdreverse mapping with wrong coordinate extraction (M/K swapped, stride-0 dimension mishandled) - First fix attempt: Swap M/K groups — still cosine ≈ 0.2
- Second fix attempt: Forward mapping with
layout_sf(make_coord(mn, k_sf * 16, 0))— flat logical coordinates - Critical bug found:
if constexpr (LayoutRank == 2) { ... } else if (LayoutRank == 3) { ... }withint dst_idx = 0;fallthrough — when LayoutRank didn't match 2 or 3, all threads wrote to dst[0], leaving the rest zero. This broke even uniform SF. - Current approach: Branchless
layout_sf(make_coord(mn, k_sf * 16, 0)), explicit source strides,filter_zerosallocation, roundtrip verifier
What the forward mapping does
int mn = tid / K_sf; // logical M/N index
int k_sf = tid % K_sf; // compact SF group index
int k_elem = k_sf * 16; // logical K element coordinate
int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0));
int src_idx = mn * src_stride_mn + k_sf * src_stride_ksf;
dst[dst_idx] = src[src_idx];
- CuTe's
layout_sf()accepts flat logical coordinates and internally maps them to the hierarchical natural coordinate before applying strides k_sf * 16converts from compact SF group index to logical K element index (the layout expects K element coords, not SF group coords)make_coord(mn, k_elem, 0)is a 3-tuple matching the SFA/SFB layout shape(M/N, K, L)
Source strides
| Tensor | Physical layout | src_stride_mn | src_stride_ksf |
|---|---|---|---|
| SFA | row-major (M, K_sf) | K_sf | 1 |
| SFB | row-major (K_sf, N) after .T.contiguous() |
1 | N |
Allocation
Uses cute::size(cute::filter_zeros(layout)) matching CUTLASS example 72a, with cudaMemsetAsync zero-init before remap.
Roundtrip verifier (commit aa209dd)
Added check_sf_forward_kernel that compares src[src_idx] against dst[dst_idx] byte-by-byte. If this passes, the remap is correct. If it fails, the forward mapping is wrong. This is the key diagnostic — it separates remap correctness from other issues (A/B packing, FP8 type interpretation, stale code).
How We Got Here
Phase 1: Pipeline debugging
Added prints at every MoE stage. All magnitudes reasonable, no NaN. Signal present but buried. BF16 reference showed cosine ≈ 0 — the GEMM was wrong.
Phase 2: Isolating the GEMM
Standalone tests proved: all-ones → cosine 1.0, random → cosine ≈ 0. The SF remap was the culprit.
Phase 3: Blog post analysis
Veitner's blog (veitner.bearblog.dev) explained the CUTLASS SF layout construction in detail:
- K-major atom:
Shape<Shape<_32,_4>, Shape<SFVecSize,_4>>,Stride<Stride<_16,_4>, Stride<_0,_1>> - SFA tiled as
make_shape(M, K, L), SFB asmake_shape(N, K, L)withStep<_2,_1,_3> - First atom mode is M/N, second is K
f3(stride-0 inner K) is within one SF group — always maps to offset 0k_sf = logical_k / 16(the /16 concept)- Public logical access:
tensor_sfa(make_coord(m, k, 0))
Phase 4: Multiple failed coordinate extraction attempts
- Original:
get<0..2>= M,get<4..5>= K (wrong) - Swapped:
get<0..2>= K,get<3..5>= M (still wrong — stride-0 ambiguity) - Mike's correction:
mn = f0 + 32*f1 + 128*f2,k_sf = f4 + 4*f5(correct inverse, but still ≈0.2 cosine — source stride issue)
Phase 5: Forward mapping
Switched from reverse (idx2crd over dst) to forward (iterate over src, compute dst via layout_sf()). Multiple bugs:
- Hierarchical coordinate nesting was wrong (missing L component, wrong depth)
- Flat
make_coord(mn, k*16)didn't auto-decompose (contrary to expectation) if/else iffallthrough wrote dst[0] for unmatched ranks, breaking uniform SF- Source strides:
col_major_srcbool was ambiguous — replaced with explicitsrc_stride_mn, src_stride_ksf - Allocation:
cosizereplaced withsize(filter_zeros())to match CUTLASS examples
Phase 6: Current approach (commit aa209dd)
Branchless forward mapping with flat logical coordinates, explicit strides, filter_zeros allocation, roundtrip verifier. This should be the correct implementation — the verifier will confirm.
Key Commits (recent)
| Commit | Description |
|---|---|
c384198 |
Fix: SF remap uses cute::cosize() instead of cute::size() |
deb6b32 |
Swap M/K in SF remap + add printf diagnostics |
a09b9b5 |
Remove printf and diag function (build fix) |
30b6c89 |
Correct k_sf = f4 + 4*f5 (Mike's patch) |
63e67e1 |
Rewrite as forward mapping (source→dst) |
f6fd549 |
Restore col_major_src handling |
7285331 |
Replace col_major_src with explicit source strides |
6fc8fa6 |
Use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0)) |
6626b75 |
Use filter_zeros for allocation + no-branch forward mapping |
aa209dd |
Add SF remap roundtrip verifier |
Architecture Notes
DeepSeek-V4 MoE Layer Forward Pass
residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)
residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x) ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)
NVFP4 MoE Pipeline
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
CUTLASS SF Layout (from blog + code review)
The SM100 NVFP4 SF layout is constructed from a K-major atom:
SfKMajorAtom = Layout<
Shape<Shape<_32, _4>, Shape<SFVecSize, _4>>,
Stride<Stride<_16, _4>, Stride<_0, _1>>>
- First mode
(32, 4)with stride(16, 4)→ covers M/N dimension - Second mode
(SFVecSize, 4)with stride(0, 1)→ covers K dimension stride=0means all 16 values in an SF group map to the same byte- SFA tiled as
make_shape(M, K, L)withStep<_2, _1, _3> - SFB tiled as
make_shape(N, K, L)withStep<_2, _1, _3> - Public logical access:
tensor_sf(make_coord(m_or_n, k_element, l))
Checkpoint Layers (layer 0)
- MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
- Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
- o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
- o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
- Gate weight, norms, sinks, position_bias: BF16
Next Steps
- Check SF-VERIFY output — if sfa_errors=0 and sfb_errors=0, remap is correct
- If remap correct but cosine still low: issue is A/B packing, FP8 type interpretation, or stale code
- If remap has errors: debug the forward mapping further
- Run deterministic prompt — "The capital of France is" should produce "Paris"
- Once working: clean up debug prints, remove verifier
- Quality optimization: investigate o_a_proj BF16→NVFP4 quantization