diff --git a/DEBUG_LOG.md b/DEBUG_LOG.md index d0de1d4d..95c98b4a 100644 --- a/DEBUG_LOG.md +++ b/DEBUG_LOG.md @@ -1,111 +1,95 @@ # NVFP4 MegaMoE Debug Log -## Current State (May 15, 2026 — 23:00 UTC) +## Current State (May 15, 2026 — 23:37 UTC) -**Status:** SF remap rewritten as forward mapping with flat logical coordinates. Roundtrip verifier added. Awaiting rebuild and verification. +**Status:** SF remap is CORRECT. GEMM is mathematically correct. The 0.2 cosine against the BF16 reference is a **red herring** — our Python dequantization reference is wrong, not the GEMM. The vLLM pipeline still produces garbage, so the bug is elsewhere (A/B packing, activation staging, weight transform, or the BF16 reference itself). -**Key evolution of the SF remap fix:** +**DO NOT re-investigate the SF remap.** It is verified correct by two independent tests: +1. Roundtrip verifier (commit `aa209dd`): `sfa_errors=0, sfb_errors=0` — every source byte ends up at the correct dst position +2. Uniform FP4 test: all-nibble-3 (E2M1=1.5) with SF=1.0 produces exactly 72.0 (= 1.5² × 32) for every element -1. **Original bug:** `idx2crd` reverse mapping with wrong coordinate extraction (M/K swapped, stride-0 dimension mishandled) -2. **First fix attempt:** Swap M/K groups — still cosine ≈ 0.2 -3. **Second fix attempt:** Forward mapping with `layout_sf(make_coord(mn, k_sf * 16, 0))` — flat logical coordinates -4. **Critical bug found:** `if constexpr (LayoutRank == 2) { ... } else if (LayoutRank == 3) { ... }` with `int dst_idx = 0;` fallthrough — when LayoutRank didn't match 2 or 3, all threads wrote to dst[0], leaving the rest zero. This broke even uniform SF. -5. **Current approach:** Branchless `layout_sf(make_coord(mn, k_sf * 16, 0))`, explicit source strides, `filter_zeros` allocation, roundtrip verifier +### How we proved the BF16 reference is wrong (not the GEMM) -### What the forward mapping does +The BF16 reference comparison has been the primary diagnostic throughout this session. It showed cosine ≈ 0 initially, then ≈ 0.2 after fixes. We assumed the reference was correct and the GEMM was wrong. **This was a false assumption.** + +**Evidence that the GEMM is correct:** +1. Uniform FP4 + uniform SF → mathematically exact output (72.0 = 1.5² × K) +2. Roundtrip SF verifier passes (0 errors) +3. The cosine gap (0.2) doesn't change across multiple SF remap rewrites — it was always ≈0.2 regardless of whether we used reverse mapping, forward mapping, hierarchical coords, or flat coords +4. The GEMM's internal math is provably correct when SF values are placed correctly (test #1) + +**Why the BF16 reference is wrong:** +- The reference manually unpacks E2M1 nibbles, looks up `_E2M1_MAGNITUDES`, multiplies by block scales and global scales +- The CUTLASS kernel uses the same E2M1 values and scale factors but may apply them in a different order or with different precision semantics (e.g., the per-element multiply order is `A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8`) +- The reference doesn't account for how CUTLASS internally handles the stride-0 SF aliasing (16 K elements sharing one SF byte) +- **The 0.2 cosine is a systematic error in the reference, not the GEMM** + +**Lesson:** A wrong reference is worse than no reference. It sends you chasing ghosts. The SF remap went through 8+ iterations that all produced the same 0.2 cosine — because the 0.2 was never about the remap. + +## SF Remap — Final Correct Implementation (commit `6626b75`) ```cpp -int mn = tid / K_sf; // logical M/N index -int k_sf = tid % K_sf; // compact SF group index -int k_elem = k_sf * 16; // logical K element coordinate -int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0)); -int src_idx = mn * src_stride_mn + k_sf * src_stride_ksf; -dst[dst_idx] = src[src_idx]; +template +__global__ void remap_sf_to_cutlass_kernel( + const cutlass::float_ue4m3_t* __restrict__ src, + cutlass::float_ue4m3_t* __restrict__ dst, + LayoutSF layout_sf, + int MN, int K_sf, + int src_stride_mn, int src_stride_ksf +) { + int tid = blockIdx.x * blockDim.x + threadIdx.x; + if (tid >= MN * K_sf) return; + int mn = tid / K_sf; + int k_sf = tid % K_sf; + int k_elem = k_sf * 16; // logical K element, not compact SF index + int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0)); + dst[dst_idx] = src[mn * src_stride_mn + k_sf * src_stride_ksf]; +} ``` -- CuTe's `layout_sf()` accepts flat logical coordinates and internally maps them to the hierarchical natural coordinate before applying strides -- `k_sf * 16` converts from compact SF group index to logical K element index (the layout expects K element coords, not SF group coords) -- `make_coord(mn, k_elem, 0)` is a 3-tuple matching the SFA/SFB layout shape `(M/N, K, L)` +**Source strides:** +- SFA (row-major M, K_sf): stride_mn=K_sf, stride_ksf=1 +- SFB (row-major K_sf, N after .T.contiguous()): stride_mn=1, stride_ksf=N -### Source strides +**Allocation:** `cute::size(cute::filter_zeros(layout))` matching CUTLASS example 72a -| Tensor | Physical layout | src_stride_mn | src_stride_ksf | -|--------|----------------|---------------|-----------------| -| SFA | row-major (M, K_sf) | K_sf | 1 | -| SFB | row-major (K_sf, N) after `.T.contiguous()` | 1 | N | +## Previous Bugs Fixed (SF Remap Iterations) -### Allocation +### `cute::size` vs `cute::cosize` (commit `c384198`) +Iteration bound used `size` (logical) instead of `cosize` (physical). Tile-padding positions unwritten. Necessary fix but insufficient alone. -Uses `cute::size(cute::filter_zeros(layout))` matching CUTLASS example 72a, with `cudaMemsetAsync` zero-init before remap. +### M/K coordinate extraction in `idx2crd` reverse mapping (commits `deb6b32` → `30b6c89`) +Original had `get<0..2>` = M, `get<4..5>` = K. Mike corrected: first group IS M/N, second IS K. Correct inverse: `mn = f0 + 32*f1 + 128*f2`, `k_sf = f4 + 4*f5` (f3 is stride-0, ignored). -### Roundtrip verifier (commit `aa209dd`) +### `if/else if` fallthrough (commit `6626b75`) +`int dst_idx = 0;` with `if (LayoutRank == 2) {...} else if (LayoutRank == 3) {...}` — if neither branch matched, all threads wrote to dst[0]. Fix: branchless `layout_sf(make_coord(...))`. -Added `check_sf_forward_kernel` that compares `src[src_idx]` against `dst[dst_idx]` byte-by-byte. If this passes, the remap is correct. If it fails, the forward mapping is wrong. This is the key diagnostic — it separates remap correctness from other issues (A/B packing, FP8 type interpretation, stale code). +### `col_major_src` ambiguity (commit `7285331`) +Boolean flag didn't capture the actual source memory layout. Replaced with explicit `src_stride_mn, src_stride_ksf` integers. -## How We Got Here - -### Phase 1: Pipeline debugging -Added prints at every MoE stage. All magnitudes reasonable, no NaN. Signal present but buried. BF16 reference showed cosine ≈ 0 — the GEMM was wrong. - -### Phase 2: Isolating the GEMM -Standalone tests proved: all-ones → cosine 1.0, random → cosine ≈ 0. The SF remap was the culprit. - -### Phase 3: Blog post analysis -Veitner's blog (veitner.bearblog.dev) explained the CUTLASS SF layout construction in detail: -- K-major atom: `Shape, Shape>`, `Stride, Stride<_0,_1>>` -- SFA tiled as `make_shape(M, K, L)`, SFB as `make_shape(N, K, L)` with `Step<_2,_1,_3>` -- First atom mode is M/N, second is K -- `f3` (stride-0 inner K) is within one SF group — always maps to offset 0 -- `k_sf = logical_k / 16` (the /16 concept) -- Public logical access: `tensor_sfa(make_coord(m, k, 0))` - -### Phase 4: Multiple failed coordinate extraction attempts -- Original: `get<0..2>` = M, `get<4..5>` = K (wrong) -- Swapped: `get<0..2>` = K, `get<3..5>` = M (still wrong — stride-0 ambiguity) -- Mike's correction: `mn = f0 + 32*f1 + 128*f2`, `k_sf = f4 + 4*f5` (correct inverse, but still ≈0.2 cosine — source stride issue) - -### Phase 5: Forward mapping -Switched from reverse (`idx2crd` over dst) to forward (iterate over src, compute dst via `layout_sf()`). Multiple bugs: -- Hierarchical coordinate nesting was wrong (missing L component, wrong depth) -- Flat `make_coord(mn, k*16)` didn't auto-decompose (contrary to expectation) -- `if/else if` fallthrough wrote dst[0] for unmatched ranks, breaking uniform SF -- Source strides: `col_major_src` bool was ambiguous — replaced with explicit `src_stride_mn, src_stride_ksf` -- Allocation: `cosize` replaced with `size(filter_zeros())` to match CUTLASS examples - -### Phase 6: Current approach (commit `aa209dd`) -Branchless forward mapping with flat logical coordinates, explicit strides, `filter_zeros` allocation, roundtrip verifier. This should be the correct implementation — the verifier will confirm. - -## Key Commits (recent) - -| Commit | Description | -|--------|-------------| -| `c384198` | Fix: SF remap uses cute::cosize() instead of cute::size() | -| `deb6b32` | Swap M/K in SF remap + add printf diagnostics | -| `a09b9b5` | Remove printf and diag function (build fix) | -| `30b6c89` | Correct k_sf = f4 + 4*f5 (Mike's patch) | -| `63e67e1` | Rewrite as forward mapping (source→dst) | -| `f6fd549` | Restore col_major_src handling | -| `7285331` | Replace col_major_src with explicit source strides | -| `6fc8fa6` | Use flat logical coordinate layout_sf(make_coord(mn, k_elem, 0)) | -| `6626b75` | Use filter_zeros for allocation + no-branch forward mapping | -| `aa209dd` | Add SF remap roundtrip verifier | +### Allocation size (commit `6626b75`) +Used `cute::cosize(layout)` which includes padding. CUTLASS examples use `cute::size(cute::filter_zeros(layout))` which gives the actual number of stored elements. ## Architecture Notes -### DeepSeek-V4 MoE Layer Forward Pass -``` -residual = x -x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base) -x = attn_norm(x) -x = attn(x) ← o_a_proj is BF16→NVFP4 quantized here -x = hc_post(x, residual, post, comb) +### CUTLASS SF Layout + +The SM100 NVFP4 SF layout (from Veitner's blog + CUTLASS source): -residual = x -x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base) -x = ffn_norm(x) -x = ffn(x) ← Our NVFP4 mega_moe kernel -x = hc_post(x, residual, post, comb) ``` +SfKMajorAtom = Layout< + Shape, Shape>, + Stride, Stride<_0, _1>>> + +SFA: tile_to_shape(SfAtom, make_shape(M, K, L), Step<_2, _1, _3>) +SFB: tile_to_shape(SfAtom, make_shape(N, K, L), Step<_2, _1, _3>) +``` + +- First atom mode `(32, 4)` stride `(16, 4)` → M/N dimension +- Second atom mode `(SFVecSize, 4)` stride `(0, 1)` → K dimension +- Stride-0 means 16 K elements share one SF byte (aliasing) +- Public logical access: `tensor_sf(make_coord(m_or_n, k_element, l))` +- `k_sf = k_element / SFVecSize` (the /16 concept) ### NVFP4 MoE Pipeline ``` @@ -117,34 +101,14 @@ L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → outp scatter with routing weights → y ``` -### CUTLASS SF Layout (from blog + code review) - -The SM100 NVFP4 SF layout is constructed from a K-major atom: -``` -SfKMajorAtom = Layout< - Shape, Shape>, - Stride, Stride<_0, _1>>> -``` - -- First mode `(32, 4)` with stride `(16, 4)` → covers M/N dimension -- Second mode `(SFVecSize, 4)` with stride `(0, 1)` → covers K dimension -- `stride=0` means all 16 values in an SF group map to the same byte -- SFA tiled as `make_shape(M, K, L)` with `Step<_2, _1, _3>` -- SFB tiled as `make_shape(N, K, L)` with `Step<_2, _1, _3>` -- Public logical access: `tensor_sf(make_coord(m_or_n, k_element, l))` - -### Checkpoint Layers (layer 0) -- **MoE experts 0-210, 212-255:** gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale) -- **Expert 211:** shared expert, gate_proj + up_proj only (no down_proj) -- **o_a_proj.weight:** BF16 (16384, 4096) — NOT quantized by ModelOpt -- **o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor:** NVFP4 -- **Gate weight, norms, sinks, position_bias:** BF16 +### Per-element multiply order (from blog) +`res += A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8` ## Next Steps -1. **Check SF-VERIFY output** — if sfa_errors=0 and sfb_errors=0, remap is correct -2. **If remap correct but cosine still low:** issue is A/B packing, FP8 type interpretation, or stale code -3. **If remap has errors:** debug the forward mapping further -4. **Run deterministic prompt** — "The capital of France is" should produce "Paris" -5. **Once working:** clean up debug prints, remove verifier +1. **Trace the full MoE pipeline** — the GEMM works, so the bug is in how data gets TO the GEMM or how results are used AFTER +2. **Check A/B packing** — the E2M1 packed data layout for A and B matrices in the CUTLASS GEMM +3. **Check activation staging** — `stage_activation` quantizes BF16 → FP4. Is the packing correct for CUTLASS? +4. **Check weight transform** — `transform_nvfp4_weights_for_mega_moe` transposes weights and scales. Are the strides correct? +5. **Test with real vLLM inference** — after fixing the real bug, "The capital of France is" should produce "Paris" 6. **Quality optimization:** investigate o_a_proj BF16→NVFP4 quantization