E2M1 magnitudes are [0, 0.5, 1, 1.5, 2, 3, 4, 6] NOT [0, 2, 3, 4, 6, 8, 12, 24].
The old LUT was 4x the correct values, causing every NVFP4 dequantized
weight to be 4x too large. This compounded across 61 layers, causing
the residual stream to explode and producing gibberish output.
This is the root cause of the residual growth and incoherent generation.
Replace custom mHCBlock with wrapper around the tested production
mHCLayer class. This eliminates any bugs in my custom implementation
and uses the same code path that the model was designed for.
Weight mapping: fn[0:4]=W_pre, fn[4:8]=W_post, fn[8:24]=W_res
base[0:4]=S_pre, base[4:8]=S_post, base[8:24]=S_res
scale[0]=alpha_pre, scale[1]=alpha_post, scale[2]=alpha_res
Checkpoint order is [pre, post, res] not [pre, res, post]:
- base[0:4] = S_pre, base[4:8] = S_post, base[8:24] = S_res
- scale[0] = alpha_pre, scale[1] = alpha_post, scale[2] = alpha_res
- W_stacked rows: [W_pre(4), W_post(4), W_res(16)]
- Projection split: A_raw=proj[:,0:4], C_raw=proj[:,4:8], B_raw=proj[:,8:24]
This was causing B_l to be near-identity and C_l to be near-2.0,
leading to exponential residual stream growth.
MoE:
- Hash routing (first 3 layers): tid2eid lookup → 6 experts, uniform weights
- Dense routing (remaining): sqrt(softplus(gate)) → top-6 → renormalize
- 384 NVFP4 experts, each gate+up+down with SiGLU clamping
- Weighted combine × routed_scaling_factor + shared expert
KV cache:
- SimpleKVCache: BF16 flat (1, max_seq, hd) per layer
- Appends new K,V each decode step
- FMHA now attends over full cached sequence (not just current token)
- RoPE applied per-position on K cache
This should produce meaningful output — the model now has all
architectural components except proper mHC normalization.
Without routed experts, F_out is always positive, causing unbounded
growth. Emergency RMSNorm on the residual keeps values bounded.
Remove once MoE is wired.
- Loads all 95 shards, assigns layers round-robin across 8 B200s
- ~8 layers per GPU, ~118GB weights per GPU (fits in 183GB)
- 3-phase pipeline: load weights → JIT compile → inference
- Activations move between GPUs at layer boundaries (NVLink)
- No streaming, no shard caching, no per-layer CPU loads
- Includes timing for each phase
Without mHC, values explode to 761K after first layer.
Added per-residual RMSNorm + BF16 clamp to keep values bounded.
This won't produce correct model output (mHC is load-bearing),
but keeps the pipeline running so we can verify the kernel.
Streams weights one layer at a time from 95 safetensors shards.
NVFP4 dequant → BF16 matmul for baseline (production uses tcgen05 MMA).
Runs token-by-token decode loop with production FMHA kernel.
Known gaps for first run:
- FFN (MoE) skipped — not the kernel under test
- mHC simplified — not the kernel under test
- RoPE skipped in baseline
- compressor/indexer bypassed (raw KV for now)
FMHA kernel is the component under test (cos ≥ 0.999993).
The 6-warp multi-tile kernel already supports batch natively via
dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input.
Single kernel launch per layer for batched decode instead of
batch_size launches.
T>1 prefill still uses per-batch dispatch (E8 future work).