MoE:
- Hash routing (first 3 layers): tid2eid lookup → 6 experts, uniform weights
- Dense routing (remaining): sqrt(softplus(gate)) → top-6 → renormalize
- 384 NVFP4 experts, each gate+up+down with SiGLU clamping
- Weighted combine × routed_scaling_factor + shared expert
KV cache:
- SimpleKVCache: BF16 flat (1, max_seq, hd) per layer
- Appends new K,V each decode step
- FMHA now attends over full cached sequence (not just current token)
- RoPE applied per-position on K cache
This should produce meaningful output — the model now has all
architectural components except proper mHC normalization.
Without routed experts, F_out is always positive, causing unbounded
growth. Emergency RMSNorm on the residual keeps values bounded.
Remove once MoE is wired.
- Loads all 95 shards, assigns layers round-robin across 8 B200s
- ~8 layers per GPU, ~118GB weights per GPU (fits in 183GB)
- 3-phase pipeline: load weights → JIT compile → inference
- Activations move between GPUs at layer boundaries (NVLink)
- No streaming, no shard caching, no per-layer CPU loads
- Includes timing for each phase
Without mHC, values explode to 761K after first layer.
Added per-residual RMSNorm + BF16 clamp to keep values bounded.
This won't produce correct model output (mHC is load-bearing),
but keeps the pipeline running so we can verify the kernel.
Streams weights one layer at a time from 95 safetensors shards.
NVFP4 dequant → BF16 matmul for baseline (production uses tcgen05 MMA).
Runs token-by-token decode loop with production FMHA kernel.
Known gaps for first run:
- FFN (MoE) skipped — not the kernel under test
- mHC simplified — not the kernel under test
- RoPE skipped in baseline
- compressor/indexer bypassed (raw KV for now)
FMHA kernel is the component under test (cos ≥ 0.999993).
The 6-warp multi-tile kernel already supports batch natively via
dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input.
Single kernel launch per layer for batched decode instead of
batch_size launches.
T>1 prefill still uses per-batch dispatch (E8 future work).
- indexer/__init__.py: compute_index_scores_topk now calls
run_indexer_score_topk with proper tensor reshaping
- compressor/__init__.py: added torch import, fixed csa_compress_tail
and hca_compress_tail imports for flush.py
- Full flush pipeline now importable end-to-end
Both gather_kv.cu and gather_swa.cu are compiled into one .so.
Only gather_kv.cu defines the PYBIND11_MODULE; gather_swa.cu
just provides the function implementations.
docs/p7_tmem_column_layout.md: Verified that tcgen05.ld 32x32b.x8 is
the correct instruction for multi-row softmax. Each call reads 8 KV
positions for 32 rows. No instruction change needed from single-row.
test_p7_multi_row_softmax.py: Tests T=1,4,32,64,128 at various HD and N.
Gate: cos >= 0.999996.
cp.async.bulk.tensor store (SMEM→GMEM) is NOT available on SM100.
The CUTLASS SM100 epilogue uses st.global directly.
The one-way epilogue pipeline is now:
1. TMEM → regs (tcgen05.ld, warp-collective)
2. epilogue_op in regs (normalize, FP4 hook via ENABLE_FP4_EPILOGUE)
3. regs → SMEM (row-major, sO_epi)
4. SMEM → GMEM (direct write)
This is the same pattern as the MoE kernel but with st.global instead
of TMA store. Multi-CTA (D2) will use st.global with flat_divide coords.
Removed: tma_o from FmhaParams, fmha_multihead_decode_tma_launch,
sMbarStore from SMEM, broken TMA store PTX from fmha_tma.cuh.
TMA store: cp.async.bulk.tensor.2d.global.shared::cluster.mbarrier::complete_tx::bytes
Uses mbarrier for completion, not bulk_group. Restored sMbarStore to SMEM.