6.1 KiB
NVFP4 MegaMoE Debug Log
Current State (May 16, 2026 — 01:15 UTC)
Status: GEMM verified correct, SF remap verified correct, B layout verified correct, L2 slot_token cleaned up. vLLM still produces garbage. The checkpoint input_scale red herring is documented below. The bug remains unidentified.
What's verified correct (DO NOT re-investigate):
- SF remap: roundtrip verifier = 0 errors, forward mapping with
layout_sf(make_coord(mn, k_elem, 0)) - GEMM math: uniform FP4 + uniform SF → exact output (72.0 = 1.5² × K)
- B matrix layout: byte transpose correct, column-dependent weight test passes
- L2 GEMM input: slot-major, no gather needed (cleaned up dead
slot_tokenparam)
What's still broken:
- "The capital of France is" → garbage tokens (varies, e.g.
-W'MSG173,( z tractor) - All magnitudes reasonable, no NaN anywhere
Red Herring: Checkpoint input_scale (May 16, 00:10 UTC)
Mike's code review suggested using the checkpoint's input_scale as the activation global scale instead of the dynamic amax/(6*448). This was wrong and has been reverted (commit 79b9bec).
What happened:
- Applied
stage_activation(hidden_states, input_global_scale=w13_input_scale) w13_input_scale = 2.86e-4(from checkpoint, same for all experts in a layer)- Dynamic
amax/(6*448)= 3.70e-3 (13x larger) - Using 2.86e-4 for normalization:
x / 2.86e-4produces values in the thousands for typical hidden states (amax ~5) - ALL block scales saturated to 448.0 (max float8_e4m3 value)
- Output: still garbage, but now with quantized-to-death activations
Why it's wrong:
The checkpoint input_scale is NOT the input_global_scale = amax/(6*448) normalization constant. They are different quantities:
input_global_scalenormalizes data to [0,1] before FP4 quantizationinput_scaleis a calibration constant from the Quark quantization tool
The calibration amax for input_scale = 2.86e-4 would be 2.86e-4 * 6 * 448 = 0.77. Runtime hidden states have amax ~5-10. The input_scale was computed on a different data distribution (probably calibration data, not actual inference data).
The correct use of input_scale is still unknown. The Quark path computes alpha = input_scale * weight_scale_2, but this may assume BF16 activations (not FP4-quantized). Our CUTLASS kernel requires FP4 input, so we must quantize with the dynamic scale.
Preserved for future use: _w13_input_scale and _w2_input_scale are now saved in finalize_weights (not dropped) in case we need them for alpha computation later.
Checkpoint input_scale values (layer 0, all experts identical):
gate_proj.input_scale=up_proj.input_scale= 2.862840e-04down_proj.input_scale= 3.069196e-02weight_scale_2= 4.650298e-05 (all projections)- Scales vary by layer: layer 0 = 2.86e-4, layer 60 = 1.07e-2
SF Remap — Final Correct Implementation (commit 6626b75)
int mn = tid / K_sf;
int k_sf = tid % K_sf;
int k_elem = k_sf * 16;
int dst_idx = layout_sf(cute::make_coord(mn, k_elem, 0));
dst[dst_idx] = src[mn * src_stride_mn + k_sf * src_stride_ksf];
Source strides: SFA=(K_sf, 1), SFB=(1, N)
Allocation: cute::size(cute::filter_zeros(layout))
Previous Bugs Fixed
BF16 reference comparison (May 15, 23:37 UTC)
The 0.2 cosine against the Python BF16 dequantization reference was a RED HERRING. The reference is wrong, not the GEMM. 8+ iterations of SF remap changes all produced the same 0.2 cosine because it was never about the remap. A wrong reference is worse than no reference.
cute::size vs cute::cosize (commit c384198)
Iteration bound used size (logical) instead of cosize (physical). Fixed but insufficient alone.
M/K coordinate extraction in idx2crd (commits deb6b32 → 30b6c89)
Original had M/K swapped. Mike's correction: mn = f0 + 32*f1 + 128*f2, k_sf = f4 + 4*f5.
if/else if fallthrough (commit 6626b75)
Dead dst_idx=0 when no branch matched. Fix: branchless layout_sf(make_coord(...)).
col_major_src ambiguity (commit 7285331)
Boolean flag → explicit src_stride_mn, src_stride_ksf.
Allocation size (commit 6626b75)
cosize → size(filter_zeros(layout)).
L2 slot_token cleanup (commit bb5a1ba)
nvfp4_moe_l2 accepted slot_token but never passed it to GEMM. Removed dead parameter.
Mike's Code Review Answers
- E2M1 packing: Confirmed correct — element 2j in low nibble, 2j+1 in high nibble. Suggested hardware oracle with
__nv_cvt_bfloat16raw2_to_fp4x2. - A RowMajor: Confirmed correct — no micro-tiling for A.
- B ColumnMajor: Byte transpose confirmed correct by test. Mike flagged theoretical concern about nibble-level transpose but our test passed.
- Alpha/global scale: Mike suggested
alpha = input_scale * weight_scale_2(from checkpoint). We tried it — wrong for activation normalization. The correct use ofinput_scalein our pipeline is still TBD. - Gate/up correction: Mathematically valid.
up_half *= up_weight_gs / gate_weight_gsis equivalent to per-column alpha.
Architecture Notes
NVFP4 MoE Pipeline
stage_activation(hidden_states) → x_fp4, x_sf, input_global_scale
L1 GEMM: (x_fp4, x_sf) @ (l1_w, l1_sf) with alpha=igs*l1_global_sf → gate_up
SiLU(gate) * up → activated
stage_activation(activated) → l1_fp4, l1_sf, l1_igs
L2 GEMM: (l1_fp4, l1_sf) @ (l2_w, l2_sf) with alpha=l1_igs*l2_global_sf → output
scatter with routing weights → y
Per-element multiply order
res += A_fp4 * SFA_fp8 * B_fp4 * SFB_fp8
Next Steps
- Compare against BF16 model — run the same prompt on a known-good implementation to see if the attention layers are working and only MoE is broken
- Check the vLLM model integration — how does the MoE output get mixed with the residual? Is
hc_postcorrect? - Understand the Quark input_scale contract — maybe we need to NOT quantize activations to FP4 and instead use BF16 input
- Add per-layer token output logging — see which layer the tokens go off the rails
- Check o_a_proj BF16 handling — it's kept in BF16 in the checkpoint, is it being processed correctly?