2cbc7459b0
diag: fix SE scale print (cast to float first)
2026-06-01 04:14:47 +00:00
bcd7a0cf0d
diag: check SE weight and scale integrity for first 3 layers
2026-06-01 04:08:21 +00:00
8ad617e2ff
diag: NaN detection in shared expert gate/up split
2026-06-01 04:01:46 +00:00
a53936a17c
diag: print l1_out shape warning in shared expert
2026-06-01 03:54:29 +00:00
db30c4acd6
auto: pre-test push for test_se_gpu.py
2026-06-01 03:50:53 +00:00
3dd95ce77b
fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them)
2026-06-01 03:43:09 +00:00
27c63b01d6
diag: remove broken SE reference comparison, add gsa/gsb print
2026-06-01 03:31:36 +00:00
9a27ed21fd
diag: compare shared expert output with PyTorch reference
2026-06-01 03:25:21 +00:00
ee8318ad58
diag: handle NaN in shared expert output print
2026-06-01 03:16:25 +00:00
7000762309
diag: fix SE weight attribute name
2026-06-01 03:09:11 +00:00
fba1c06cad
diag: check SE weight integrity
2026-06-01 03:02:44 +00:00
22d7cc9b7a
diag: cuda sync check after shared expert for first 3 layers
2026-06-01 02:56:28 +00:00
b85fcf4d6f
diag: print SE global scales for first 3 layers
2026-06-01 02:49:55 +00:00
48d93a6d2e
diag: MoE input/output diagnostics for first 3 layers
2026-06-01 02:41:12 +00:00
856a459a98
fix: init l1_gsa_list and l2_gsa_list
2026-06-01 02:34:21 +00:00
66b98e5794
fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear)
2026-06-01 02:31:12 +00:00
f4b444b456
fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale
2026-06-01 02:19:35 +00:00
1eed28dd09
diag: compare production FMHA and NVFP4 linear output with PyTorch reference
2026-06-01 02:12:39 +00:00
df394f8b40
fix: missing closing quote on string literal
2026-06-01 02:02:14 +00:00
cfd2468c61
fix: decode loop also needs int32 token_ids for hash router
2026-06-01 01:58:45 +00:00
905623793b
fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N)
2026-06-01 01:49:40 +00:00
7804b779ce
diag: print wo_a g_flat magnitude to find where zeros come from
2026-06-01 01:40:53 +00:00
efe63caea9
diag: print FMHA output magnitude for first 3 layers
2026-06-01 01:34:02 +00:00
7fbbdc5204
diag: validate router output before MoE
2026-06-01 01:27:16 +00:00
f5fa84016e
diag: sync+error check after each layer on first token
2026-06-01 01:26:50 +00:00
91b3929605
fix: call moe_runner.run() and se_runner.run() (not __call__)
2026-06-01 01:14:38 +00:00
03c45d4bfb
fix: pass int32 token_ids to hash router (was int64)
2026-06-01 01:08:03 +00:00
62efde5c9f
fix: router — use cuBLAS BF16 GEMM + activation_topk CUDA kernel (production path, not CuTeDSL fused)
2026-06-01 01:01:15 +00:00
5591a725e1
fix: router kernel — infer OperandMajorMode from tensor layout (same pattern as MoE GEMM)
2026-06-01 00:59:18 +00:00
0ab5d8c317
fix: disable broken CuTeDSL fused router — use BF16 linear + activation_topk (both are production paths)
2026-06-01 00:56:00 +00:00
c339fe7ad9
fix: router A operand major mode MN (not K) — fixes CuTeDSL local_tile coord error
2026-06-01 00:54:19 +00:00
b7a8c44d26
single_shot: eager MoE/SE weight processing, stale GPU cleanup, --prefill-tokens flag
2026-06-01 00:42:08 +00:00
15f45b57c3
fix: correct Nvfp4Linear dimension inference from checkpoint weights
...
Weight shape (N_packed, K_packed) means:
- out_features = N_packed (GEMM output dim in BF16)
- in_features = K_packed * 2 (BF16 input dim, for activation buffer)
2026-06-01 00:32:36 +00:00
e671780008
fix: transpose checkpoint weights before make_b_k_major in Nvfp4Linear/SharedExpert
...
Critical bug: checkpoint weights are (N_packed, K_packed) N-major format,
but make_b_k_major expects (E, K_packed, N_packed) input. Without the
permute, the K and N dimensions are swapped, producing garbage output
with wrong dimensions (e.g., q_a output was 3584 instead of 1536).
Also fix scale assembly: checkpoint scales are (N, K_sf) which should
use assemble_raw_scales_2d3d_3d_side (no transpose), not
assemble_scales_3d_side (which incorrectly transposes K_sf↔N).
2026-06-01 00:30:37 +00:00
e8a7a9256f
fix: convert uint8 checkpoint weights to float4_e2m1fn_x2 for CuTeDSL GEMM
...
The CuTeDSL kernel expects float4_e2m1fn_x2 dtype for FP4 weight tensors,
but checkpoint weights from safetensors are loaded as uint8. The uint8 and
float4_e2m1fn_x2 have the same byte representation, so .view() is safe.
Fixed in:
- Nvfp4Linear.finalize_weights()
- Nvfp4SharedExpert.finalize_weights()
- Nvfp4MoE._ensure_stacked() (both stacked and legacy paths)
2026-06-01 00:18:34 +00:00
172448514c
fix: fold weight_scale_2 into global_scale_b for NVFP4 GEMM
...
Critical bug fix: weight_scale_2 (the second-level NVFP4 scale) was
being dropped entirely in the production pipeline. The dequant formula
is lut[w] * weight_scale * weight_scale_2, so weight_scale_2 must be
folded into the GEMM's global_scale_b parameter.
Fixes in:
- Nvfp4Linear: ws2 field, folded in finalize_weights()
- Nvfp4MoE: l1_ws2/l2_ws2 lists, folded in _ensure_stacked()
- Nvfp4SharedExpert: l1_ws2/l2_ws2 lists, folded in finalize_weights()
- single_shot_inference.py: pass weight_scale_2 through all loading paths
- Also fix missing o_a_prod key fallback in attention output
2026-06-01 00:10:50 +00:00
563df02aef
fix: import SF_VEC_SIZE from quantize in gemm_runner (was NameError)
2026-06-01 00:04:48 +00:00
be476b2ce2
router: catch CuTeDSL warmup failures fast, don't let MLIR errors slow down init
2026-06-01 00:00:07 +00:00
56dff8d185
fix: W_gate is (H, E) but F.linear expects (E, H), transpose before linear
2026-05-31 23:55:16 +00:00
5396a04c28
router: broaden except to catch all CuTeDSL errors, fall through to cuBLAS+activation_topk path
2026-05-31 23:54:16 +00:00
3b5b9f487c
fix: compute num_tma_load_bytes inside cute.compile context
2026-05-31 23:53:13 +00:00
1bc0da0f35
fix: properly scope swap code inside else/guard blocks, replace continue with if guard
2026-05-31 23:51:43 +00:00
d0d765e1f2
fix: replace break statements with flag-based loops in router kernel (CuTeDSL restriction)
2026-05-31 23:50:39 +00:00
210391e571
fix: PersistentTileSchedulerParams constructor takes (problem_shape, cluster_shape) not from_shape
2026-05-31 23:49:12 +00:00
824d054ad7
fix: inside cute.compile args are already CuTe tensors, no conversion needed
2026-05-31 23:47:33 +00:00
6375e54396
fix: use from_dlpack + mark_layout_dynamic instead of non-existent to_cuTe_tensor in router
2026-05-31 23:46:35 +00:00
cb2ca8591f
fix: add @cute.jit to router compiled function
2026-05-31 23:44:53 +00:00
d5d2b7b4b8
fix: defer router MMA/TMA setup into cute.compile context (matches MoE pattern)
2026-05-31 23:44:00 +00:00
157f1c5258
fix: use OperandMajorMode from nvgpu (not deprecated tcgen05) and mma_tiler_mn in router kernel
2026-05-31 23:39:50 +00:00
1dbc57e2cd
fix: use mma_tiler_mn in _create_tiled_mma (attribute exists at init time)
2026-05-31 23:36:01 +00:00