Critical bug: checkpoint weights are (N_packed, K_packed) N-major format,
but make_b_k_major expects (E, K_packed, N_packed) input. Without the
permute, the K and N dimensions are swapped, producing garbage output
with wrong dimensions (e.g., q_a output was 3584 instead of 1536).
Also fix scale assembly: checkpoint scales are (N, K_sf) which should
use assemble_raw_scales_2d3d_3d_side (no transpose), not
assemble_scales_3d_side (which incorrectly transposes K_sf↔N).
The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors,
but checkpoint weights from safetensors are loaded as uint8. The uint8 and
float4_e2m1fn_x2 have the same byte representation, so .view() is safe.
Fixed in:
- Nvfp4Linear.finalize_weights()
- Nvfp4SharedExpert.finalize_weights()
- Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)
Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was
being dropped entirely in the production pipeline. The dequant formula
is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be
folded into the GEMM's global_scale_b parameter.
Fixes in:
- Nvfp4Linear: ws2 field, folded in finalize_weights()
- Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked()
- Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights()
- single_shot_inference.py: pass weight_scale_2 through all loading paths
- Also fix missing o_a_prod key fallback in attention output
Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT
triggered by _ensure_initialized). Disable for now to validate
the critical paths first:
- Production FMHA with sink bias
- Production MoE (Nvfp4MoE + Nvfp4SharedExpert)
- Production Router (dense/hash)
- Production mHC
Attention projections use reference dequant+matmul for now.
Will re-enable Nvfp4Linear after validating MoE path.
The sink bias from the checkpoint is already in the scaled domain
(added to QK*scale in the reference softmax). The kernel's
running_max is max(QK*scale), so the sink should be compared
directly without multiplying by scale again.
When N<128, padded KV positions have my_p_vals[col] uninitialized
for col >= kv_len. The PV GEMM then computes garbage_P × zero_V,
which can produce NaN on tensor cores (0 × NaN = NaN).
Fix: zero-initialize my_p_vals so padded positions contribute 0.
Build stacked (E, N, K) tensors incrementally on CPU, then move to GPU
in one shot. Avoids holding 384 individual expert weight+scale tensors
on GPU simultaneously (~3x memory savings per layer).
ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment
but then passed the PADDED N to the kernel as s_k (logical KV length).
This told the kernel all 128 entries were valid, so softmax ran over
zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128).
FIX: Pass N_orig (true sequence length) as s_k for softmax masking,
and N_padded (physical size) only for TMA descriptor creation.
The kernel's existing col < kv_len guard correctly excludes padded
entries from row_max and exp_sum calculations.
Files changed:
- fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for
params.s_k and N_padded for TMA descriptors
- fmha_multitile_op.py: pass N_orig and N_padded separately
- single_shot_inference.py: removed SDPA fallback (kernel now correct)
input_scale is the activation quantization scale (for FP8 inputs).
Since we use BF16 activations, the weight dequant is simply:
lut[weight] * weight_scale * weight_scale_2
Folding input_scale in produced weights ~4000x too small,
causing all attention and FFN outputs to be effectively zero.