Commit Graph

2039 Commits

Author SHA1 Message Date
cf2b7ab7ec feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)
The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate
projection when NVFP4 scales are available in the checkpoint. This
replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core
NVFP4 acceleration.

Changes:
- dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate
  fallback. New load_nvfp4_gate() method.
- dsv4/kernels/router/dense_router_decode.py: add
  dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk
- dsv4/kernels/router/__init__.py: export new function
- single_shot_inference.py: load NVFP4 gate weights when available,
  fall back to BF16 when not
2026-06-01 05:58:56 +00:00
9f14cb17d1 test: add compressor position_bias unit test
Verifies CUDA kernel matches PyTorch reference with and without
position_bias for both CSA (m=4) and HCA (m=128) paths.
2026-06-01 05:55:05 +00:00
84ca520bfb fix: move compressor position_bias into CUDA kernel (was Python loop)
The compressor_reduce.cu kernel now adds position_bias to BOTH kv and
gate values, matching the PyTorch reference. Previously the kernel only
added it to gate, and a Python workaround loop was adding it to both
before the kernel call (then passing None to the kernel).

Changes:
- compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA)
- single_shot_inference.py: remove Python position_bias loop, pass
  self.ape directly to csa/hca_compress_production
- production_compress.py: already supports position_bias passthrough
2026-06-01 05:54:44 +00:00
311fae490f tune: reduce verbose diagnostics, print every decode step v-e2e-paris-32tok-20260601-0549 2026-06-01 05:40:48 +00:00
df8acae66b fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks v-single-shot-paris-20260601-0539 2026-06-01 05:24:18 +00:00
62041b78bf fix: import torch.utils.cpp_extension explicitly in production_compress 2026-06-01 05:20:44 +00:00
2155fd6c90 test: production compressor kernel unit test 2026-06-01 05:19:13 +00:00
b380028c49 feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel
- New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm
  One block per compressed entry, 128 threads, FP32 accumulation
  CSA: overlapping Ca/Cb streams (2m tokens per block)
  HCA: single stream (m tokens per block)
  Includes apply_kv_norm kernel (unweighted RMSNorm + weight)

- New production_compress.py: Python wrapper for CUDA kernels

- single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear
  for kv_proj, gate_proj, q_b_proj, weights_proj projections
  Then CUDA reduce kernel for softmax + weighted sum
  No more PyTorch reference nvfp4_linear_ref in compressor/indexer path
2026-06-01 05:18:59 +00:00
6e53e3007c fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow v-working-e2e-20260601-0515 2026-06-01 04:59:06 +00:00
eb9c46f8cb test: quantize on different GPUs 2026-06-01 04:48:30 +00:00
9ce7304783 test: direct SE L1 test on different GPUs 2026-06-01 04:43:48 +00:00
ce608d0e50 test: fix gemm 1-group test params 2026-06-01 04:40:07 +00:00
c652177970 test: fix gemm 1-group test 2026-06-01 04:35:55 +00:00
793f062bbc auto: pre-test push for test_gemm_1group.py 2026-06-01 04:32:29 +00:00
86cb0e64a6 auto: pre-test push for test_se_dequant.py 2026-06-01 04:30:37 +00:00
9ba051cf49 test: fix gsa in SE multi-GPU test 2026-06-01 04:26:03 +00:00
419112dd3e auto: pre-test push for test_se_multi_gpu.py 2026-06-01 04:22:38 +00:00
2cbc7459b0 diag: fix SE scale print (cast to float first) 2026-06-01 04:14:47 +00:00
bcd7a0cf0d diag: check SE weight and scale integrity for first 3 layers 2026-06-01 04:08:21 +00:00
8ad617e2ff diag: NaN detection in shared expert gate/up split 2026-06-01 04:01:46 +00:00
a53936a17c diag: print l1_out shape warning in shared expert 2026-06-01 03:54:29 +00:00
db30c4acd6 auto: pre-test push for test_se_gpu.py 2026-06-01 03:50:53 +00:00
3dd95ce77b fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them) 2026-06-01 03:43:09 +00:00
27c63b01d6 diag: remove broken SE reference comparison, add gsa/gsb print 2026-06-01 03:31:36 +00:00
9a27ed21fd diag: compare shared expert output with PyTorch reference 2026-06-01 03:25:21 +00:00
ee8318ad58 diag: handle NaN in shared expert output print 2026-06-01 03:16:25 +00:00
7000762309 diag: fix SE weight attribute name 2026-06-01 03:09:11 +00:00
fba1c06cad diag: check SE weight integrity 2026-06-01 03:02:44 +00:00
22d7cc9b7a diag: cuda sync check after shared expert for first 3 layers 2026-06-01 02:56:28 +00:00
b85fcf4d6f diag: print SE global scales for first 3 layers 2026-06-01 02:49:55 +00:00
48d93a6d2e diag: MoE input/output diagnostics for first 3 layers 2026-06-01 02:41:12 +00:00
856a459a98 fix: init l1_gsa_list and l2_gsa_list 2026-06-01 02:34:21 +00:00
66b98e5794 fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear) 2026-06-01 02:31:12 +00:00
f4b444b456 fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale 2026-06-01 02:19:35 +00:00
1eed28dd09 diag: compare production FMHA and NVFP4 linear output with PyTorch reference 2026-06-01 02:12:39 +00:00
df394f8b40 fix: missing closing quote on string literal 2026-06-01 02:02:14 +00:00
cfd2468c61 fix: decode loop also needs int32 token_ids for hash router 2026-06-01 01:58:45 +00:00
905623793b fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N) 2026-06-01 01:49:40 +00:00
7804b779ce diag: print wo_a g_flat magnitude to find where zeros come from 2026-06-01 01:40:53 +00:00
efe63caea9 diag: print FMHA output magnitude for first 3 layers 2026-06-01 01:34:02 +00:00
7fbbdc5204 diag: validate router output before MoE 2026-06-01 01:27:16 +00:00
f5fa84016e diag: sync+error check after each layer on first token 2026-06-01 01:26:50 +00:00
91b3929605 fix: call moe_runner.run() and se_runner.run() (not __call__) 2026-06-01 01:14:38 +00:00
03c45d4bfb fix: pass int32 token_ids to hash router (was int64) 2026-06-01 01:08:03 +00:00
62efde5c9f fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused) 2026-06-01 01:01:15 +00:00
5591a725e1 fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM) 2026-06-01 00:59:18 +00:00
0ab5d8c317 fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths) 2026-06-01 00:56:00 +00:00
c339fe7ad9 fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error 2026-06-01 00:54:19 +00:00
b7a8c44d26 single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag 2026-06-01 00:42:08 +00:00
15f45b57c3 fix: correct Nvfp4Linear dimension inference from checkpoint weights
Weight shape (N_packed, K_packed) means:
- out_features = N_packed (GEMM output dim in BF16)
- in_features = K_packed * 2 (BF16 input dim, for activation buffer)
2026-06-01 00:32:36 +00:00