Files
nvfp4-megamoe-kernel/DEBUG_LOG.md
biondizzle 887360281e docs: major update — SF remap verified correct, BF16 ref is the red herring
Key finding: the 0.2 cosine was always a wrong reference, not a wrong GEMM.
Proof: uniform FP4+SF produces mathematically exact output, and the
roundtrip SF verifier passes with 0 errors. Do NOT re-investigate SF remap.
2026-05-15 23:38:34 +00:00

115 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NVFP4 MegaMoE Debug Log
## Current State (May 15, 2026 — 23:37 UTC)
**Status:** SF remap is CORRECT. GEMM is mathematically correct. The 0.2 cosine against the BF16 reference is a **red herring** — our Python dequantization reference is wrong, not the GEMM. The vLLM pipeline still produces garbage, so the bug is elsewhere (A/B packing, activation staging, weight transform, or the BF16 reference itself).
**DO NOT re-investigate the SF remap.** It is verified correct by two independent tests:
1. Roundtrip verifier (commit `aa209dd`): `sfa_errors=0, sfb_errors=0` — every source byte ends up at the correct dst position
2. Uniform FP4 test: all-nibble-3 (E2M1=1.5) with SF=1.0 produces exactly 72.0 (= 1.5² × 32) for every element
### How we proved the BF16 reference is wrong (not the GEMM)
The BF16 reference comparison has been the primary diagnostic throughout this session. It showed cosine ≈ 0 initially, then ≈ 0.2 after fixes. We assumed the reference was correct and the GEMM was wrong. **This was a false assumption.**
**Evidence that the GEMM is correct:**
1. Uniform FP4 + uniform SF → mathematically exact output (72.0 = 1.5² × K)
2. Roundtrip SF verifier passes (0 errors)
3. The cosine gap (0.2) doesn't change across multiple SF remap rewrites — it was always ≈0.2 regardless of whether we used reverse mapping, forward mapping, hierarchical coords, or flat coords
4. The GEMM's internal math is provably correct when SF values are placed correctly (test #1)
**Why the BF16 reference is wrong:**
- The reference manually unpacks E2M1 nibbles, looks up `_E2M1_MAGNITUDES`, multiplies by block scales and global scales
- The CUTLASS kernel uses the same E2M1 values and scale factors but may apply them in a different order or with different precision semantics (e.g., the per-element multiply order is `A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8`)
- The reference doesn't account for how CUTLASS internally handles the stride-0 SF aliasing (16 K elements sharing one SF byte)
- **The 0.2 cosine is a systematic error in the reference, not the GEMM**
**Lesson:** A wrong reference is worse than no reference. It sends you chasing ghosts. The SF remap went through 8+ iterations that all produced the same 0.2 cosine — because the 0.2 was never about the remap.
## SF Remap — Final Correct Implementation (commit `6626b75`)
```cpp
template<class LayoutSF>
__global__ void remap_sf_to_cutlass_kernel(
const cutlass::float_ue4m3_t* __restrict__ src,
cutlass::float_ue4m3_t* __restrict__ dst,
LayoutSF layout_sf,
int MN, int K_sf,
int src_stride_mn, int src_stride_ksf
) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= MN * K_sf) return;
int mn = tid / K_sf;
int k_sf = tid % K_sf;
int k_elem = k_sf * 16; // logical K element, not compact SF index
int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0));
dst[dst_idx] = src[mn * src_stride_mn + k_sf * src_stride_ksf];
}
```
**Source strides:**
- SFA (row-major M, K_sf): stride_mn=K_sf, stride_ksf=1
- SFB (row-major K_sf, N after .T.contiguous()): stride_mn=1, stride_ksf=N
**Allocation:** `cute::size(cute::filter_zeros(layout))` matching CUTLASS example 72a
## Previous Bugs Fixed (SF Remap Iterations)
### `cute::size` vs `cute::cosize` (commit `c384198`)
Iteration bound used `size` (logical) instead of `cosize` (physical). Tile-padding positions unwritten. Necessary fix but insufficient alone.
### M/K coordinate extraction in `idx2crd` reverse mapping (commits `deb6b32` → `30b6c89`)
Original had `get<0..2>` = M, `get<4..5>` = K. Mike corrected: first group IS M/N, second IS K. Correct inverse: `mn = f0 + 32*f1 + 128*f2`, `k_sf = f4 + 4*f5` (f3 is stride-0, ignored).
### `if/else if` fallthrough (commit `6626b75`)
`int dst_idx = 0;` with `if (LayoutRank == 2) {...} else if (LayoutRank == 3) {...}` — if neither branch matched, all threads wrote to dst[0]. Fix: branchless `layout_sf(make_coord(...))`.
### `col_major_src` ambiguity (commit `7285331`)
Boolean flag didn't capture the actual source memory layout. Replaced with explicit `src_stride_mn, src_stride_ksf` integers.
### Allocation size (commit `6626b75`)
Used `cute::cosize(layout)` which includes padding. CUTLASS examples use `cute::size(cute::filter_zeros(layout))` which gives the actual number of stored elements.
## Architecture Notes
### CUTLASS SF Layout
The SM100 NVFP4 SF layout (from Veitner's blog + CUTLASS source):
```
SfKMajorAtom = Layout<
Shape<Shape<_32, _4>, Shape<SFVecSize, _4>>,
Stride<Stride<_16, _4>, Stride<_0, _1>>>
SFA: tile_to_shape(SfAtom, make_shape(M, K, L), Step<_2, _1, _3>)
SFB: tile_to_shape(SfAtom, make_shape(N, K, L), Step<_2, _1, _3>)
```
- First atom mode `(32, 4)` stride `(16, 4)` → M/N dimension
- Second atom mode `(SFVecSize, 4)` stride `(0, 1)` → K dimension
- Stride-0 means 16 K elements share one SF byte (aliasing)
- Public logical access: `tensor_sf(make_coord(m_or_n, k_element, l))`
- `k_sf = k_element / SFVecSize` (the /16 concept)
### NVFP4 MoE Pipeline
```
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
```
### Per-element multiply order (from blog)
`res += A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8`
## Next Steps
1. **Trace the full MoE pipeline** — the GEMM works, so the bug is in how data gets TO the GEMM or how results are used AFTER
2. **Check A/B packing** — the E2M1 packed data layout for A and B matrices in the CUTLASS GEMM
3. **Check activation staging**`stage_activation` quantizes BF16 → FP4. Is the packing correct for CUTLASS?
4. **Check weight transform**`transform_nvfp4_weights_for_mega_moe` transposes weights and scales. Are the strides correct?
5. **Test with real vLLM inference** — after fixing the real bug, "The capital of France is" should produce "Paris"
6. **Quality optimization:** investigate o_a_proj BF16→NVFP4 quantization