Commit Graph

148 Commits

Author SHA1 Message Date
71deeb91a9 Quantize BF16 gate weight to NVFP4 for fused router + add global scales to GEMM
CRITICAL: Checkpoint stores gate weights as BF16, not NVFP4.
Previous code fell back to BF16 cuBLAS because weight_scale was missing.
Now we quantize the BF16 gate weight to NVFP4 at load time using
quantize_to_nvfp4() and pass the result to the fused router kernel.

Also added global scale (gsa, gsb) parameters to the kernel:
- gsa (activation global scale) applied during activation quantization
- gsb (weight global scale) applied in epilogue before sqrt(softplus)
- The MMA output is (A * SFA) @ (B * SFB), missing gsa*gsb
- Epilogue now computes sqrt(softplus(logit * gsa * gsb))
  instead of sqrt(softplus(logit))
2026-06-01 10:14:29 +00:00
31ebe4f2db Wire NVFP4 fused router kernel into e2e single-shot pipeline
- Add dense_router_dispatch_nvfp4_fused() in dense_router_decode.py:
  single-kernel NVFP4 blockscaled GEMM + fused router epilogue
- Router.load_nvfp4_fused_gate(): stores raw NVFP4 tensors for fused path
- Router._run_dense_impl() dispatch priority: fused > 2-kernel > BF16
- single_shot_inference.py: loads raw NVFP4 gate weights for fused kernel
  instead of building Nvfp4Linear (which was the 2-kernel path)
- Fix selection sort bug in nvfp4_fused_router_kernel.py: pass 0 was
  missing t_s/t_i/t_a temp save before swap, causing undefined vars
- Export dense_router_dispatch_nvfp4_fused from __init__.py
2026-06-01 09:47:48 +00:00
6c28c57b6a feat: Nvfp4GroupedLinear for o_a_proj (replaces BF16 grouped BMM)
The attention output projection first half (wo_a) was using BF16
grouped BMM (torch.bmm). Now uses production Nvfp4GroupedLinear
which performs the same grouped GEMM with NVFP4 tensor-core
acceleration on Blackwell.

The weight is loaded from NVFP4 checkpoint if available, otherwise
quantized from BF16 via set_bf16_weight().

Also includes:
- NVFP4 gate projection for router (from previous commit)
- Compressor position_bias in CUDA kernel (from earlier fix)
2026-06-01 06:00:36 +00:00
cf2b7ab7ec feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)
The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate
projection when NVFP4 scales are available in the checkpoint. This
replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core
NVFP4 acceleration.

Changes:
- dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate
  fallback. New load_nvfp4_gate() method.
- dsv4/kernels/router/dense_router_decode.py: add
  dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk
- dsv4/kernels/router/__init__.py: export new function
- single_shot_inference.py: load NVFP4 gate weights when available,
  fall back to BF16 when not
2026-06-01 05:58:56 +00:00
84ca520bfb fix: move compressor position_bias into CUDA kernel (was Python loop)
The compressor_reduce.cu kernel now adds position_bias to BOTH kv and
gate values, matching the PyTorch reference. Previously the kernel only
added it to gate, and a Python workaround loop was adding it to both
before the kernel call (then passing None to the kernel).

Changes:
- compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA)
- single_shot_inference.py: remove Python position_bias loop, pass
  self.ape directly to csa/hca_compress_production
- production_compress.py: already supports position_bias passthrough
2026-06-01 05:54:44 +00:00
311fae490f tune: reduce verbose diagnostics, print every decode step 2026-06-01 05:40:48 +00:00
b380028c49 feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel
- New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm
  One block per compressed entry, 128 threads, FP32 accumulation
  CSA: overlapping Ca/Cb streams (2m tokens per block)
  HCA: single stream (m tokens per block)
  Includes apply_kv_norm kernel (unweighted RMSNorm + weight)

- New production_compress.py: Python wrapper for CUDA kernels

- single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear
  for kv_proj, gate_proj, q_b_proj, weights_proj projections
  Then CUDA reduce kernel for softmax + weighted sum
  No more PyTorch reference nvfp4_linear_ref in compressor/indexer path
2026-06-01 05:18:59 +00:00
2cbc7459b0 diag: fix SE scale print (cast to float first) 2026-06-01 04:14:47 +00:00
bcd7a0cf0d diag: check SE weight and scale integrity for first 3 layers 2026-06-01 04:08:21 +00:00
3dd95ce77b fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them) 2026-06-01 03:43:09 +00:00
27c63b01d6 diag: remove broken SE reference comparison, add gsa/gsb print 2026-06-01 03:31:36 +00:00
9a27ed21fd diag: compare shared expert output with PyTorch reference 2026-06-01 03:25:21 +00:00
ee8318ad58 diag: handle NaN in shared expert output print 2026-06-01 03:16:25 +00:00
7000762309 diag: fix SE weight attribute name 2026-06-01 03:09:11 +00:00
fba1c06cad diag: check SE weight integrity 2026-06-01 03:02:44 +00:00
22d7cc9b7a diag: cuda sync check after shared expert for first 3 layers 2026-06-01 02:56:28 +00:00
b85fcf4d6f diag: print SE global scales for first 3 layers 2026-06-01 02:49:55 +00:00
48d93a6d2e diag: MoE input/output diagnostics for first 3 layers 2026-06-01 02:41:12 +00:00
856a459a98 fix: init l1_gsa_list and l2_gsa_list 2026-06-01 02:34:21 +00:00
66b98e5794 fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear) 2026-06-01 02:31:12 +00:00
f4b444b456 fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale 2026-06-01 02:19:35 +00:00
1eed28dd09 diag: compare production FMHA and NVFP4 linear output with PyTorch reference 2026-06-01 02:12:39 +00:00
df394f8b40 fix: missing closing quote on string literal 2026-06-01 02:02:14 +00:00
cfd2468c61 fix: decode loop also needs int32 token_ids for hash router 2026-06-01 01:58:45 +00:00
905623793b fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N) 2026-06-01 01:49:40 +00:00
7804b779ce diag: print wo_a g_flat magnitude to find where zeros come from 2026-06-01 01:40:53 +00:00
efe63caea9 diag: print FMHA output magnitude for first 3 layers 2026-06-01 01:34:02 +00:00
7fbbdc5204 diag: validate router output before MoE 2026-06-01 01:27:16 +00:00
f5fa84016e diag: sync+error check after each layer on first token 2026-06-01 01:26:50 +00:00
91b3929605 fix: call moe_runner.run() and se_runner.run() (not __call__) 2026-06-01 01:14:38 +00:00
03c45d4bfb fix: pass int32 token_ids to hash router (was int64) 2026-06-01 01:08:03 +00:00
b7a8c44d26 single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag 2026-06-01 00:42:08 +00:00
15f45b57c3 fix: correct Nvfp4Linear dimension inference from checkpoint weights
Weight shape (N_packed, K_packed) means:
- out_features = N_packed (GEMM output dim in BF16)
- in_features = K_packed * 2 (BF16 input dim, for activation buffer)
2026-06-01 00:32:36 +00:00
e671780008 fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert
Critical bug: checkpoint weights are (N_packed, K_packed) N-major format,
but make_b_k_major expects (E, K_packed, N_packed) input. Without the
permute, the K and N dimensions are swapped, producing garbage output
with wrong dimensions (e.g., q_a output was 3584 instead of 1536).

Also fix scale assembly: checkpoint scales are (N, K_sf) which should
use assemble_raw_scales_2d3d_3d_side (no transpose), not
assemble_scales_3d_side (which incorrectly transposes K_sf↔N).
2026-06-01 00:30:37 +00:00
172448514c fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM
Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was
being dropped entirely in the production pipeline. The dequant formula
is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be
folded into the GEMM's global_scale_b parameter.

Fixes in:
- Nvfp4Linear: ws2 field, folded in finalize_weights()
- Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked()
- Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights()
- single_shot_inference.py: pass weight_scale_2 through all loading paths
- Also fix missing o_a_prod key fallback in attention output
2026-06-01 00:10:50 +00:00
a6a8755439 single_shot: switch to head-packed FMHA dispatch (1 kernel launch vs 128) 2026-05-31 23:33:32 +00:00
80002f2efc single_shot: production NVFP4 GEMM for ALL attention projections
- Nvfp4Linear (CuTeDSL) for q_a, q_b, kv, o_b — NO more dequant+matmul
- Production FMHA (6-warp TMA multi-tile) with per-head sink bias
- Production MoE + Router + SharedExpert + mHC (unchanged)
- wo_a still uses BF16 grouped BMM (checkpoint is BF16)
- Compressor/Indexer still PyTorch ref (not yet on tensor cores)
- Proper weight dimensions: q_a(7168->1536), q_b(1536->65536), kv(7168->512), o_b(16384->7168)
2026-05-31 23:28:16 +00:00
32efd5139d Fix gate weight transpose: checkpoint is (E, H), Router expects (H, E) 2026-05-31 23:21:09 +00:00
e45c0ff51b single_shot: use reference dequant for attn projections, focus on MoE+FMHA
Nvfp4Linear causing CUDA context corruption (likely CuTeDSL JIT
triggered by _ensure_initialized). Disable for now to validate
the critical paths first:
- Production FMHA with sink bias
- Production MoE (Nvfp4MoE + Nvfp4SharedExpert)
- Production Router (dense/hash)
- Production mHC

Attention projections use reference dequant+matmul for now.
Will re-enable Nvfp4Linear after validating MoE path.
2026-05-31 23:20:04 +00:00
dfbffa1df1 single_shot: CUDA_LAUNCH_BLOCKING for debugging 2026-05-31 23:18:35 +00:00
a66fdf6049 single_shot: add sync to catch CUDA errors early 2026-05-31 23:17:46 +00:00
0b35c36d23 single_shot: memory-efficient MoE loading, lazy Nvfp4Linear init
- MoE expert weights loaded per-expert to GPU (no huge CPU tensors)
- Nvfp4Linear finalize_weights deferred (lazy on first forward)
- Shared expert weights loaded directly to GPU
- Added GPU cache cleanup at start
- Fixed shared expert finalize_weights (now lazy)
2026-05-31 23:16:45 +00:00
050b5ee449 Fix n_h reference before assignment in single_shot 2026-05-31 23:14:24 +00:00
13be3ad443 FMHA sink bias in kernel + single_shot production rewrite
FMHA kernel (fmha_6warp_tma_multirow_multitile.cuh):
- Added sink_bias field to FmhaTmaMultiRowMultiTileParams
- After KV tile loop, sink logit is included in online softmax rescale:
  new_max = max(running_max, sink_bias * scale)
  rescale existing O_unnorm and running_sum
  running_sum += exp(sink_bias * scale - new_max)
  No PV contribution from sink (D5c: single softmax)
- C API: fmha_multitile_decode_launch now takes sink_bias_ptr
- Python: fmha_multitile_decode_raw accepts attn_sink tensor

single_shot_inference.py:
- Full rewrite to use production kernel stack
- mHC: uses dsv4.layers.mhc.mHCLayer (proper Sinkhorn-Knopp)
- Projections: uses Nvfp4Linear (CuTeDSL GEMM) for q_a, q_b, kv, o_b
- FMHA: 6-warp TMA multi-tile with sink bias (no SDPA fallback)
- MoE: Nvfp4MoE + Nvfp4SharedExpert (no reference fallback)
- Router: production dense/hash dispatch
- Compressor/Indexer: reference dequant (not yet on tensor cores)
- NO try/except fallbacks on production paths
2026-05-31 23:10:13 +00:00
23e88638aa single_shot: memory-efficient MoE loading (CPU stacking, one-shot GPU transfer)
Build stacked (E, N, K) tensors incrementally on CPU, then move to GPU
in one shot. Avoids holding 384 individual expert weight+scale tensors
on GPU simultaneously (~3x memory savings per layer).
2026-05-31 22:55:11 +00:00
92200367f3 FMHA kernel fix: N_orig vs N_padded — correct softmax masking for seq_len < 128
ROOT CAUSE: fmha_multitile_op.py padded N to 128 for TMA alignment
but then passed the PADDED N to the kernel as s_k (logical KV length).
This told the kernel all 128 entries were valid, so softmax ran over
zeros, diluting the result (e.g. 1 valid entry → softmax weight 1/128).

FIX: Pass N_orig (true sequence length) as s_k for softmax masking,
and N_padded (physical size) only for TMA descriptor creation.
The kernel's existing col < kv_len guard correctly excludes padded
entries from row_max and exp_sum calculations.

Files changed:
- fmha_multitile_capi.cu: accept N_orig + N_padded, use N_orig for
  params.s_k and N_padded for TMA descriptors
- fmha_multitile_op.py: pass N_orig and N_padded separately
- single_shot_inference.py: removed SDPA fallback (kernel now correct)
2026-05-31 22:52:39 +00:00
d40821c843 single_shot: fix memory (no double-loading MoE weights), FMHA short-seq fallback
- Don't cache MoE/SE expert weights in layer_w (handled by runners)
  This saves ~10.6GB/layer × 61 = ~647GB of double-loaded GPU memory
- Add FMHA fallback for seq_len < 128 (known kernel limitation:
  zero-padding dilutes softmax). TODO: fix kernel to mask padded entries.
- Free all_w and empty GPU caches after building runners
2026-05-31 22:49:15 +00:00
91568e12d4 single_shot_inference.py: production kernel stack version
- FMHA: 6-warp TMA multi-tile kernel via dsv4_attention
- MoE: Nvfp4MoE (CuTeDSL NVFP4 grouped GEMM, fused SwiGLU)
- Shared expert: Nvfp4SharedExpert (CuTeDSL NVFP4 single-group GEMM)
- Router: production dense/hash router kernels
- Compressor: CSA/HCA token-level softmax
- Indexer: score+topk
- mHC: Sinkhorn-Knopp, B_l transposed, [pre,post,comb]
- No PyTorch SDPA, no F.linear for kernel paths
- Falls back to dequant BF16 only if production kernels fail
- FP32 RoPE cache (BF16 destroys cos²+sin²=1)
2026-05-31 22:45:44 +00:00
fb96c34b89 rename: single_shot_inference.py → single_shot_PYTORCH_REFERENCE.py 2026-05-31 22:42:06 +00:00
acc20dffd7 CRITICAL FIX: don't fold input_scale into NVFP4 weight dequant
input_scale is the activation quantization scale (for FP8 inputs).
Since we use BF16 activations, the weight dequant is simply:
  lut[weight] * weight_scale * weight_scale_2

Folding input_scale in produced weights ~4000x too small,
causing all attention and FFN outputs to be effectively zero.
2026-05-31 22:03:55 +00:00