cf2b7ab7ec
feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)
...
The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate
projection when NVFP4 scales are available in the checkpoint. This
replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core
NVFP4 acceleration.
Changes:
- dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate
fallback. New load_nvfp4_gate() method.
- dsv4/kernels/router/dense_router_decode.py: add
dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk
- dsv4/kernels/router/__init__.py: export new function
- single_shot_inference.py: load NVFP4 gate weights when available,
fall back to BF16 when not
2026-06-01 05:58:56 +00:00
9f14cb17d1
test: add compressor position_bias unit test
...
Verifies CUDA kernel matches PyTorch reference with and without
position_bias for both CSA (m=4) and HCA (m=128) paths.
2026-06-01 05:55:05 +00:00
84ca520bfb
fix: move compressor position_bias into CUDA kernel (was Python loop)
...
The compressor_reduce.cu kernel now adds position_bias to BOTH kv and
gate values, matching the PyTorch reference. Previously the kernel only
added it to gate, and a Python workaround loop was adding it to both
before the kernel call (then passing None to the kernel).
Changes:
- compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA)
- single_shot_inference.py: remove Python position_bias loop, pass
self.ape directly to csa/hca_compress_production
- production_compress.py: already supports position_bias passthrough
2026-06-01 05:54:44 +00:00
311fae490f
tune: reduce verbose diagnostics, print every decode step
v-e2e-paris-32tok-20260601-0549
2026-06-01 05:40:48 +00:00
df8acae66b
fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks
v-single-shot-paris-20260601-0539
2026-06-01 05:24:18 +00:00
62041b78bf
fix: import torch.utils.cpp_extension explicitly in production_compress
2026-06-01 05:20:44 +00:00
2155fd6c90
test: production compressor kernel unit test
2026-06-01 05:19:13 +00:00
b380028c49
feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel
...
- New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm
One block per compressed entry, 128 threads, FP32 accumulation
CSA: overlapping Ca/Cb streams (2m tokens per block)
HCA: single stream (m tokens per block)
Includes apply_kv_norm kernel (unweighted RMSNorm + weight)
- New production_compress.py: Python wrapper for CUDA kernels
- single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear
for kv_proj, gate_proj, q_b_proj, weights_proj projections
Then CUDA reduce kernel for softmax + weighted sum
No more PyTorch reference nvfp4_linear_ref in compressor/indexer path
2026-06-01 05:18:59 +00:00
6e53e3007c
fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow
v-working-e2e-20260601-0515
2026-06-01 04:59:06 +00:00
eb9c46f8cb
test: quantize on different GPUs
2026-06-01 04:48:30 +00:00
9ce7304783
test: direct SE L1 test on different GPUs
2026-06-01 04:43:48 +00:00
ce608d0e50
test: fix gemm 1-group test params
2026-06-01 04:40:07 +00:00
c652177970
test: fix gemm 1-group test
2026-06-01 04:35:55 +00:00
793f062bbc
auto: pre-test push for test_gemm_1group.py
2026-06-01 04:32:29 +00:00
86cb0e64a6
auto: pre-test push for test_se_dequant.py
2026-06-01 04:30:37 +00:00
9ba051cf49
test: fix gsa in SE multi-GPU test
2026-06-01 04:26:03 +00:00
419112dd3e
auto: pre-test push for test_se_multi_gpu.py
2026-06-01 04:22:38 +00:00
2cbc7459b0
diag: fix SE scale print (cast to float first)
2026-06-01 04:14:47 +00:00
bcd7a0cf0d
diag: check SE weight and scale integrity for first 3 layers
2026-06-01 04:08:21 +00:00
8ad617e2ff
diag: NaN detection in shared expert gate/up split
2026-06-01 04:01:46 +00:00
a53936a17c
diag: print l1_out shape warning in shared expert
2026-06-01 03:54:29 +00:00
db30c4acd6
auto: pre-test push for test_se_gpu.py
2026-06-01 03:50:53 +00:00
3dd95ce77b
fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them)
2026-06-01 03:43:09 +00:00
27c63b01d6
diag: remove broken SE reference comparison, add gsa/gsb print
2026-06-01 03:31:36 +00:00
9a27ed21fd
diag: compare shared expert output with PyTorch reference
2026-06-01 03:25:21 +00:00
ee8318ad58
diag: handle NaN in shared expert output print
2026-06-01 03:16:25 +00:00
7000762309
diag: fix SE weight attribute name
2026-06-01 03:09:11 +00:00
fba1c06cad
diag: check SE weight integrity
2026-06-01 03:02:44 +00:00
22d7cc9b7a
diag: cuda sync check after shared expert for first 3 layers
2026-06-01 02:56:28 +00:00
b85fcf4d6f
diag: print SE global scales for first 3 layers
2026-06-01 02:49:55 +00:00
48d93a6d2e
diag: MoE input/output diagnostics for first 3 layers
2026-06-01 02:41:12 +00:00
856a459a98
fix: init l1_gsa_list and l2_gsa_list
2026-06-01 02:34:21 +00:00
66b98e5794
fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear)
2026-06-01 02:31:12 +00:00
f4b444b456
fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale
2026-06-01 02:19:35 +00:00
1eed28dd09
diag: compare production FMHA and NVFP4 linear output with PyTorch reference
2026-06-01 02:12:39 +00:00
df394f8b40
fix: missing closing quote on string literal
2026-06-01 02:02:14 +00:00
cfd2468c61
fix: decode loop also needs int32 token_ids for hash router
2026-06-01 01:58:45 +00:00
905623793b
fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N)
2026-06-01 01:49:40 +00:00
7804b779ce
diag: print wo_a g_flat magnitude to find where zeros come from
2026-06-01 01:40:53 +00:00
efe63caea9
diag: print FMHA output magnitude for first 3 layers
2026-06-01 01:34:02 +00:00
7fbbdc5204
diag: validate router output before MoE
2026-06-01 01:27:16 +00:00
f5fa84016e
diag: sync+error check after each layer on first token
2026-06-01 01:26:50 +00:00
91b3929605
fix: call moe_runner.run() and se_runner.run() (not __call__)
2026-06-01 01:14:38 +00:00
03c45d4bfb
fix: pass int32 token_ids to hash router (was int64)
2026-06-01 01:08:03 +00:00
62efde5c9f
fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused)
2026-06-01 01:01:15 +00:00
5591a725e1
fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM)
2026-06-01 00:59:18 +00:00
0ab5d8c317
fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths)
2026-06-01 00:56:00 +00:00
c339fe7ad9
fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error
2026-06-01 00:54:19 +00:00
b7a8c44d26
single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag
2026-06-01 00:42:08 +00:00
15f45b57c3
fix: correct Nvfp4Linear dimension inference from checkpoint weights
...
Weight shape (N_packed, K_packed) means:
- out_features = N_packed (GEMM output dim in BF16)
- in_features = K_packed * 2 (BF16 input dim, for activation buffer)
2026-06-01 00:32:36 +00:00