Critical bug: checkpoint weights are (N_packed, K_packed) N-major format,
but make_b_k_major expects (E, K_packed, N_packed) input. Without the
permute, the K and N dimensions are swapped, producing garbage output
with wrong dimensions (e.g., q_a output was 3584 instead of 1536).
Also fix scale assembly: checkpoint scales are (N, K_sf) which should
use assemble_raw_scales_2d3d_3d_side (no transpose), not
assemble_scales_3d_side (which incorrectly transposes K_sf↔N).
The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors,
but checkpoint weights from safetensors are loaded as uint8. The uint8 and
float4_e2m1fn_x2 have the same byte representation, so .view() is safe.
Fixed in:
- Nvfp4Linear.finalize_weights()
- Nvfp4SharedExpert.finalize_weights()
- Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)
Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was
being dropped entirely in the production pipeline. The dequant formula
is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be
folded into the GEMM's global_scale_b parameter.
Fixes in:
- Nvfp4Linear: ws2 field, folded in finalize_weights()
- Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked()
- Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights()
- single_shot_inference.py: pass weight_scale_2 through all loading paths
- Also fix missing o_a_prod key fallback in attention output
The sink bias from the checkpoint is already in the scaled domain
(added to QK*scale in the reference softmax). The kernel's
running_max is max(QK*scale), so the sink should be compared
directly without multiplying by scale again.
When N<128, padded KV positions have my_p_vals[col] uninitialized
for col >= kv_len. The PV GEMM then computes garbage_P × zero_V,
which can produce NaN on tensor cores (0 × NaN = NaN).
Fix: zero-initialize my_p_vals so padded positions contribute 0.
ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment
but then passed the PADDED N to the kernel as s_k (logical KV length).
This told the kernel all 128 entries were valid, so softmax ran over
zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128).
FIX: Pass N_orig (true sequence length) as s_k for softmax masking,
and N_padded (physical size) only for TMA descriptor creation.
The kernel's existing col < kv_len guard correctly excludes padded
entries from row_max and exp_sum calculations.
Files changed:
- fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for
params.s_k and N_padded for TMA descriptors
- fmha_multitile_op.py: pass N_orig and N_padded separately
- single_shot_inference.py: removed SDPA fallback (kernel now correct)
Bugs fixed (verified against HuggingFace DeepseekV4HyperConnection):
1. fn/base/scale ordering was [pre,comb,post], should be [pre,post,comb]
- Was applying Sinkhorn to post values and 2*sigmoid to comb values
- This caused residual to grow unbounded (no doubly-stochastic constraint)
2. comb (B_l) must be TRANSPOSED in post_block
- HF: comb.transpose(-1,-2) @ hidden_streams
- Was using B_l @ X_l without transpose
3. Sinkhorn must start from softmax(logits) + eps, not exp(logits)
- HF: softmax → col norm → (iters-1) alternating
- Was using exp → alternating (different convergence behavior)
4. Missing hc_eps on pre (A_l)
- HF: sigmoid(...) + hc_eps
- Was missing the eps guard
5. Renamed W_res→W_comb, S_res→S_comb, alpha_res→alpha_comb throughout
- Matches checkpoint naming and HF model
6. Fixed fallback mHC initialization to use new API
The 6-warp multi-tile kernel already supports batch natively via
dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input.
Single kernel launch per layer for batched decode instead of
batch_size launches.
T>1 prefill still uses per-batch dispatch (E8 future work).
- indexer/__init__.py: compute_index_scores_topk now calls
run_indexer_score_topk with proper tensor reshaping
- compressor/__init__.py: added torch import, fixed csa_compress_tail
and hca_compress_tail imports for flush.py
- Full flush pipeline now importable end-to-end
Both gather_kv.cu and gather_swa.cu are compiled into one .so.
Only gather_kv.cu defines the PYBIND11_MODULE; gather_swa.cu
just provides the function implementations.
cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100.
The CUTLASS SM100 epilogue uses st.global directly.
The one-way epilogue pipeline is now:
1. TMEM → regs (tcgen05.ld, warp-collective)
2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE)
3. regs → SMEM (row-major, sO_epi)
4. SMEM → GMEM (direct write)
This is the same pattern as the MoE kernel but with st.global instead
of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords.
Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch,
sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.
TMA store: cp.async.bulk.tensor.2d.global.shared::cluster.mbarrier::complete_tx::bytes
Uses mbarrier for completion, not bulk_group. Restored sMbarStore to SMEM.
TMA store uses cp.async.bulk.tensor.2d.global.shared::cta.tile.bulk_group
NOT mbarrier::complete_tx::bytes. Completion tracked via:
- cp.async.bulk.commit_group (after issuing stores)
- cp.async.bulk.wait_group.read 0 (wait for all groups)
Removed sMbarStore from SMEM allocations (no longer needed).