Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT
triggered by _ensure_initialized). Disable for now to validate
the critical paths first:
- Production FMHA with sink bias
- Production MoE (Nvfp4MoE + Nvfp4SharedExpert)
- Production Router (dense/hash)
- Production mHC
Attention projections use reference dequant+matmul for now.
Will re-enable Nvfp4Linear after validating MoE path.
The sink bias from the checkpoint is already in the scaled domain
(added to QK*scale in the reference softmax). The kernel's
running_max is max(QK*scale), so the sink should be compared
directly without multiplying by scale again.
When N<128, padded KV positions have my_p_vals[col] uninitialized
for col >= kv_len. The PV GEMM then computes garbage_P × zero_V,
which can produce NaN on tensor cores (0 × NaN = NaN).
Fix: zero-initialize my_p_vals so padded positions contribute 0.
Build stacked (E, N, K) tensors incrementally on CPU, then move to GPU
in one shot. Avoids holding 384 individual expert weight+scale tensors
on GPU simultaneously (~3x memory savings per layer).
ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment
but then passed the PADDED N to the kernel as s_k (logical KV length).
This told the kernel all 128 entries were valid, so softmax ran over
zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128).
FIX: Pass N_orig (true sequence length) as s_k for softmax masking,
and N_padded (physical size) only for TMA descriptor creation.
The kernel's existing col < kv_len guard correctly excludes padded
entries from row_max and exp_sum calculations.
Files changed:
- fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for
params.s_k and N_padded for TMA descriptors
- fmha_multitile_op.py: pass N_orig and N_padded separately
- single_shot_inference.py: removed SDPA fallback (kernel now correct)
input_scale is the activation quantization scale (for FP8 inputs).
Since we use BF16 activations, the weight dequant is simply:
lut[weight] * weight_scale * weight_scale_2
Folding input_scale in produced weights ~4000x too small,
causing all attention and FFN outputs to be effectively zero.
The model uses DeepseekV4HyperHead to project from the 4-stream mHC
residual to the final hidden state. Just taking stream 0 (X[:,0,:])
is WRONG — the hc_head learns how to combine the 4 streams.
Also:
- Remove --no-thinking mode (this is a reasoning model, it MUST think)
- Increase default max_tokens from 512 to 4096
- Load hc_head weights (fn, base, scale) from checkpoint
Compares forward_layer output with step-by-step PyTorch reference
to identify where residual blowup originates. Uses our own NVFP4
dequant — no HF dependency.
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
- Was applying Sinkhorn to post values and 2*sigmoid to comb values
- This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
- HF: comb.transpose(-1,-2) @ hidden_streams
- Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
- HF: softmax → col norm → (iters-1) alternating
- Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
- HF: sigmoid(...) + hc_eps
- Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
- Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API