fn rows are [W_pre(4), W_res(16), W_post(4)] matching [A_raw, B_raw, C_raw]
in _dynamic_params. Was loading as [W_pre(4), W_post(4), W_res(16)] which
shifted W_res rows by 4 and loaded wrong rows as W_post. This caused the
Sinkhorn-Knopp B_l matrix to be computed from wrong weights, allowing the
residual to explode (|X| 0.8 → 160K across 61 layers).
Correct: fn[0:4]=W_pre, fn[4:20]=W_res, fn[20:24]=W_post
Wrong: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
- System prompt added via chat template (reasoning model needs instructions)
- MAX_NEW_TOKENS=512 (reasoning chain-of-thought needs more tokens)
- --no-inverse-rope flag to test without inverse RoPE on attn output
- --skip-moe flag to debug with shared expert only
- --max-tokens and --prompt CLI overrides
- minimal_e2e_test(): processes 'The' through full model, checks logits,
tracks per-layer residual stream evolution, reports NaN/Inf/spread
- INVERSE_ROPE doc: explains partial RoPE only affects last 64/512 dims,
first 448 always un-RoPE'd, relative encoding may be intentional
FMHA pads N to next multiple of 128. For N<<128 (like 5 tokens),
the 123 padded zero-K entries contribute exp(0)=1 to the softmax
denominator, diluting real attention weights by ~128/5 = 25.6x.
This caused the model to produce incoherent output for short prompts.
Fix: use SDPA for seq_len < 120 (no padding), FMHA for longer
sequences where the padding effect is negligible.
Also: SDPA path includes attention sinks (paper D5c), FMHA path
uses analytic sink correction via LSE.
Instead of SDPA with virtual sink position, use the production FMHA
kernel and apply the sink bias as a post-hoc correction on the output.
The correction is: O_sink = O_raw * exp(lse) / (exp(lse) + exp(sink))
This simulates the attention sink (paper D5c) without modifying the
FMHA kernel. The sink absorbs some attention mass, reducing the
normalization constant and scaling down the output.
E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24].
The old LUT was 4x the correct values, causing every NVFP4 dequantized
weight to be 4x too large. This compounded across 61 layers, causing
the residual stream to explode and producing gibberish output.
This is the root cause of the residual growth and incoherent generation.
Replace custom mHCBlock with wrapper around the tested production
mHCLayer class. This eliminates any bugs in my custom implementation
and uses the same code path that the model was designed for.
Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res
scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res
Checkpoint order is [pre, post, res] not [pre, res, post]:
- base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res
- scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res
- W_stacked rows: [W_pre(4), W_post(4), W_res(16)]
- Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24]
This was causing B_l to be near-identity and C_l to be near-2.0,
leading to exponential residual stream growth.