biondizzle/nvfp4-megamoe-kernel

Fork 0

Files

biondizzle e7c3341317 docs: update DEBUG_LOG with M/K swap root cause

2026-05-15 20:03:20 +00:00

12 KiB

Raw Blame History

NVFP4 MegaMoE Debug Log

Current State (May 15, 2026)

Status: Second root cause identified — SF remap coordinate extraction has M/K swapped. Awaiting rebuild and test.

Root Cause #1 (partially fixed): `cute::size` vs `cute::cosize` (commit `c384198`)

The SF remap kernel used cute::size(layout_sf) as the iteration bound instead of cute::cosize(layout_sf). This left tile-padding positions unwritten (zero). Fix: one-line change size → cosize. However, this fix alone did NOT resolve the cosine ≈ 0 problem — random data still produced garbage.

Root Cause #2 (current): M/K coordinates swapped in SF remap (commit `deb6b32`)

After the cosize fix failed to resolve the issue, we ran deeper diagnostics:

All-ones test (M=1, N=32, K=32): cosine = 1.0 ✅ (uniform SF masks any coordinate bug)
Random data (same dimensions): cosine ≈ 0.2 ❌
Isolated SFA and SFB remap: both broken (cosine 0.16 and 0.21 respectively)

The remap kernel's coordinate extraction assumed get<0..2> = M group and get<4..5> = K group. But analysis of the CUTLASS Sm1xxBlockScaledConfig layout reveals the opposite: the SfAtom is K-major with Step<_2,_1>, meaning the first atom dimension tiles along K (problem dim 1) and the second tiles along M (problem dim 0). So get<0..2> = K group, get<3..5> = M group.

Previous (wrong):

m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;

Fixed (commit deb6b32):

k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);

Also added printf diagnostics in the remap kernel to print the first 10 coordinate mappings, so we can verify the extraction at runtime.

Why the M/K swap produces cosine ≈ 0 instead of just a permuted output: The source SF data is row-major (M, K_sf) for SFA. If we read src[wrong_m * K_sf + wrong_k_sf] instead of src[m * K_sf + k_sf], and the wrong indices don't correspond to valid source positions, we get completely unrelated SF values. This corrupts the per-block scaling, making the GEMM output essentially random relative to the correct answer.

How We Found It

Step 1: Pipeline trace

Added debug prints at every stage (L1 GEMM, SiLU, L2 GEMM, scatter). All magnitudes reasonable, no NaN. The signal was present but buried.

Step 2: BF16 reference comparison

Built a reference path that dequantizes FP4→BF16 and runs a standard matmul. Compared to the CUTLASS GEMM output. Result: cosine ≈ 0 across all 8 TP ranks — the GEMM output was essentially uncorrelated with the correct answer.

Step 3: Standalone GEMM tests

All-ones data (M=1, N=32, K=32): cosine = 1.0 ✅
Random data (M=1, N=32, K=32): cosine ≈ 0.2 ❌
Random data (M=128, N=6144, K=7168): cosine ≈ 0 ❌

The all-ones test passing proved the GEMM math and data layout were correct. Random data failing proved the SF handling was broken for non-uniform values.

Step 4: Found the bug

The CU file had a comment on lines 114-115 explicitly warning: "Allocation must use cute::cosize() (physical size including tile padding), not cute::size() (logical size)." All allocation sites used cosize correctly. But the iteration bound in the remap kernel (line 128) used size. One line we missed when we previously audited size→cosize.

Hypotheses Investigated

1. ❌ NaN/Inf in GEMM

Ruled out. All outputs finite, no NaN detected at any stage.

2. ❌ Weight shape mismatch

Ruled out. All shapes consistent: L1 w=(48,3584,6144) sf=(48,448,6144), L2 w=(48,1536,7168) sf=(48,192,7168).

3. ❌ Global scale folding precision loss

Previously identified (commit da5572f). Folding float8 block_sf × float32 global_sf → float8 loses ~25% precision. Fixed by passing global scales as per-expert alpha. Did not fix the garbage output (wrong root cause).

4. ❌ Broken kernel (CUDA_ERROR_LAUNCH_FAILED)

Previously identified (May 13). The original DeepGEMM kernel crashed. Replaced with CUTLASS-based implementation. Standalone test showed cosine=1.0 but only with uniform SF data.

5. ❌ E2M1 packing convention mismatch

Investigated but ruled out. Both stage_activation and checkpoint weights use the same packing (even→low nibble, odd→high nibble). The all-ones test proved packing is correct.

6. 🔍 Attention output corruption from o_a_proj quantization

Status: Deferred. The checkpoint has o_a_proj.weight as BF16 (16384 × 4096). The weight loader quantizes it to NVFP4 at load time. This is a potential source of quality loss but is NOT the cause of the garbage output (the GEMM bug was). May revisit for quality optimization after the kernel fix is confirmed.

7. ✅ BF16 reference comparison — COSINE ≈ 0

Status: CONFIRMED. Cosine similarity ≈ 0 between NVFP4 GEMM and BF16 dequantized reference across all 8 TP ranks. This proved the problem was in the CUTLASS GEMM itself, not upstream.

[TP0] cosine=-0.001789  mse=1.0201e+01  nvfp4_amax=8.5625  ref_amax=8.0000
[TP1] cosine= 0.030470  mse=1.0157e+01  nvfp4_amax=8.0625  ref_amax=8.3125
[TP2] cosine=-0.009217  mse=9.5978e+00  nvfp4_amax=9.1875  ref_amax=7.5312
[TP3] cosine= 0.001786  mse=9.4161e+00  nvfp4_amax=8.6875  ref_amax=8.8750
[TP4] cosine= 0.007108  mse=7.5709e+00  nvfp4_amax=7.3125  ref_amax=8.8750
[TP5] cosine=-0.000572  mse=7.8290e+00  nvfp4_amax=7.5938  ref_amax=10.562
[TP6] cosine= 0.012143  mse=9.2720e+00  nvfp4_amax=8.0000  ref_amax=8.1250
[TP7] cosine=-0.010009  mse=9.0296e+00  nvfp4_amax=6.6250  ref_amax=36.500

8. ✅ CUTLASS SF remap `size` vs `cosize` bug (commit `c384198`) — partial fix

Status: Fixed but insufficient. Changing size to cosize was necessary (tile-padding positions were unwritten) but did NOT resolve the cosine ≈ 0 problem. The real issue was the M/K swap in coordinate extraction (hypothesis #9).

9. ✅ SF remap M/K coordinate swap — ROOT CAUSE (commit `deb6b32`)

Status: FIXED, awaiting rebuild verification. The SF remap kernel had M and K coordinates swapped in the flattened coordinate extraction. The CUTLASS Sm1xxBlockScaledConfig uses a K-major SfAtom with Step<_2,_1>, meaning get<0..2> maps to the K dimension and get<3..5> maps to the M dimension. Our code had it backwards.

How we proved it:

cosize fix alone didn't resolve cosine ≈ 0
All-ones test (uniform SF) still passed — coordinate bugs are invisible with uniform data
Isolated SFA vs SFB: both broken (cosine 0.16, 0.21)
Analyzed CUTLASS source: Sm1xxBlockScaledBasicChunk uses SfKMajorAtom where first group = K, second = M
Added printf diagnostics to verify at runtime

Key Commits

Commit	Description
`da5572f`	Stop folding global scale into float8 block scales (precision loss fix)
`d0ed3d8`	Add L2, SiLU, and scatter pipeline prints
`995589a`	Add FP4 quantization round-trip diagnostic
`c421a66`	Add BF16 reference GEMM + cosine comparison for L1
`2fd55a9`	Fix weight reshape bug (K_half,N×2 → K,N) + igs double-count
`9159cb6`	Add DEBUG_LOG.md documentation
`de8acc7`	Dump raw GEMM inputs + first 8 output values
`755f9ad`	Fix per_expert_alpha ref + clean up BF16 reference scaling
`df916b8`	Fix gs.item() for multi-element tensor
`7739674`	Fix gs scalar conversion with .cpu().tolist() + add traceback
`1b63a46`	Update DEBUG_LOG with cosine≈0 finding
`fee5a97`	Fix cosine_similarity dim for M>0
`f9330a1`	Standalone M=1 GEMM test with deterministic data
`363dd89`	Dimension sweep to isolate GEMM bug
`60f7f60`	Ultra-minimal GEMM with all-ones (cosine=1.0!)
`67dcfa8`	Random data at small dims + alpha sweep
`c384198`	Fix: SF remap uses cute::cosize() instead of cute::size()
`deb6b32`	FIX: Swap M/K in SF remap coordinate extraction + add printf diagnostics

Bugs Fixed During This Debug Session

🔴 ROOT CAUSE: SF remap M/K coordinate swap (commit `deb6b32`)

Bug: The SF remap kernel in cutlass_nvfp4_gemm.cu had M and K coordinates swapped in the flattened coordinate extraction. The code assumed get<0..2> = M group and get<4..5> = K group, but the CUTLASS SfKMajorAtom layout has K first and M second (K-major, with Step<_2,_1> tiling).

Previous (wrong):

m = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
k_sf = get<4>(flat) + get<5>(flat) * 4;

Fixed:

k_sf = get<0>(flat) + get<1>(flat) * 32 + get<2>(flat) * 128;
m = get<3>(flat) + get<4>(flat) * InputSFVectorSize + get<5>(flat) * (InputSFVectorSize * 4);

Why the original code looked correct: The comment said ((32, 4, n_m_tiles), (16, 4, n_k_tiles)) — M first, K second. But this is the logical M/K assignment, not the physical flattened order. The actual CUTE layout for K-major SF puts the K group first in the flattened coordinate.

Impact: Every SF value was read from src[wrong_m * K_sf + wrong_k_sf] instead of src[m * K_sf + k_sf], producing completely unrelated scale factors. The GEMM output was essentially random (cosine ≈ 0).

SF remap `size` vs `cosize` (commit `c384198`) — necessary but insufficient

Bug: Iteration bound used cute::size (logical) instead of cute::cosize (physical). Tile-padding positions were never written.

Impact: With uniform SF, invisible. With non-uniform SF, additional corruption on top of the M/K swap bug. Both fixes are needed.

Weight nibble unpack reshape bug (commit `2fd55a9`)

Bug: In the BF16 reference diagnostic, reshape(K_half, -1) on 2D weight flattened N dimension.

Fix: reshape(K_half*2, N).

Impact: Only diagnostic code.

BF16 reference diagnostic: multiple bugs (commits `c421a66`→`7739674`)

Weight reshape: reshape(K_half, -1) → reshape(K_half*2, N)
per_expert_alpha not defined: reference code ran before alpha was computed
gs.item() on multi-element tensor: gs is shape (2,); fixed with .cpu().tolist()
igs double-count: multiplying by igs in both x_bf16 and final output

Impact: All bugs only in diagnostic code.

Architecture Notes

DeepSeek-V4 MoE Layer Forward Pass

residual = x
x, post, comb = hc_pre(x, hc_attn_fn, hc_attn_scale, hc_attn_base)
x = attn_norm(x)
x = attn(x)                          ← o_a_proj is BF16→NVFP4 quantized here
x = hc_post(x, residual, post, comb)

residual = x
x, post, comb = hc_pre(x, hc_ffn_fn, hc_ffn_scale, hc_ffn_base)
x = ffn_norm(x)
x = ffn(x)                           ← Our NVFP4 mega_moe kernel
x = hc_post(x, residual, post, comb)

NVFP4 MoE Pipeline

stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y

Checkpoint Layers (layer 0)

MoE experts 0-210, 212-255: gate_proj, up_proj, down_proj — all NVFP4 (uint8 + float8 scales + float32 global scale)
Expert 211: shared expert, gate_proj + up_proj only (no down_proj)
o_a_proj.weight: BF16 (16384, 4096) — NOT quantized by ModelOpt
o_b_proj, q_a_proj, q_b_proj, kv_proj, compressor: NVFP4
Gate weight, norms, sinks, position_bias: BF16

Next Steps

Rebuild container with M/K swap fix — Mike rebuilds with commit deb6b32
Run standalone random GEMM test — should now show cosine ≈ 1.0 with random data
Check printf diagnostics — verify the coordinate mapping is correct
Run deterministic prompt — "The capital of France is" should produce "Paris"
If output is still off: the M/K swap fix may need refinement — the m stride calculation (InputSFVectorSize * 4) may not be correct for all cases
Once working: remove printf diagnostics from production code, clean up debug prints
Quality optimization: investigate o_a_proj BF16→NVFP4 quantization (hypothesis #6)

12 KiB Raw Blame History Unescape Escape