-
fbc1e883f2
Add try/except around fused NVFP4 gate loading with error reporting
biondizzle
2026-06-01 11:08:06 +00:00
-
5f38430423
Fix: use 1-dim tensors for gate_ws2 and gate_input_scale
biondizzle
2026-06-01 11:05:09 +00:00
-
ec8f292112
Fix: use self.mma_tiler_mnk (full K=64) for SMEM layout computation
biondizzle
2026-06-01 11:03:08 +00:00
-
44fb9b6c00
Fix: pass self.mma_tiler_mnk (full K) to _compute_stages, not self.mma_tiler (K=1 placeholder)
biondizzle
2026-06-01 10:55:43 +00:00
-
be2bb2fe84
Fix: self.mma_tiler_mnk not mma_tiler_mnk
biondizzle
2026-06-01 10:49:05 +00:00
-
c082843ecc
Fix: mma_tiler K=1 placeholder in __init__, refined in _setup_attributes
biondizzle
2026-06-01 10:42:21 +00:00
-
e0f60b9f05
Fix fused router: plain ints for mma_tiler + @cute.jit pattern
biondizzle
2026-06-01 10:37:15 +00:00
-
057ae2101e
CRITICAL FIX: Move tiled_mma creation and _setup_attributes OUTSIDE @cute.jit
biondizzle
2026-06-01 10:28:01 +00:00
-
71deeb91a9
Quantize BF16 gate weight to NVFP4 for fused router + add global scales to GEMM
biondizzle
2026-06-01 10:14:29 +00:00
-
24fed15ed6
Fix: convert PyTorch tensors to CuTe tensors for fused router kernel
biondizzle
2026-06-01 10:02:40 +00:00
-
bab748763e
Rewrite NVFP4 fused router kernel: MoE-style epilogue replaces broken SMEM merge
biondizzle
2026-06-01 09:59:34 +00:00
-
31ebe4f2db
Wire NVFP4 fused router kernel into e2e single-shot pipeline
biondizzle
2026-06-01 09:47:48 +00:00
-
d9d3ca42b0
Fix: mma_tiler and cluster_layout must use MLIR values for cute.slice_
biondizzle
2026-06-01 09:42:17 +00:00
-
ec79f30709
Fix: PersistentTileSchedulerParams cluster_shape must be Python ints not MLIR values
biondizzle
2026-06-01 09:38:08 +00:00
-
28d0cb4f41
Revert cutlass.Int32 wrapping — now inside @cute.jit, cute.round_up works
biondizzle
2026-06-01 09:35:03 +00:00
-
b536f99192
CRITICAL FIX: move ALL CuTe DSL setup inside @cute.jit context
biondizzle
2026-06-01 09:32:05 +00:00
-
65669596d4
Fix: all CuTe shape values must be cutlass.Int32 for MLIR compatibility
biondizzle
2026-06-01 09:30:15 +00:00
-
df48dacc2b
Fix: set mma_inst_shape_mn in __init__ before _create_tiled_mma call
biondizzle
2026-06-01 09:22:24 +00:00
-
28f78420c2
Fix: quantize_activation_nvfp4 API - correct signature and return values
biondizzle
2026-06-01 09:21:04 +00:00
-
7b3f6cb13c
Fix fused router: use run_nvfp4_fused_router wrapper, correct CuTe tensor API
biondizzle
2026-06-01 09:19:48 +00:00
-
483e759d53
Fix: use tensor.mark_layout_dynamic() method (not cute.mark_layout_dynamic)
biondizzle
2026-06-01 09:16:33 +00:00
-
2412745b21
Test fix: slice NVFP4 logits to actual expert count (GEMM padding)
biondizzle
2026-06-01 09:15:06 +00:00
-
f33ca41c2a
Fused router: replace nested if/else top-k with flat find-min-replace approach
biondizzle
2026-06-01 09:13:53 +00:00
-
4f4ae8febd
Test: enumerate CuTeDSL math API to check available operations
biondizzle
2026-06-01 09:11:29 +00:00
-
9b86b2b414
Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup
biondizzle
2026-06-01 08:56:20 +00:00
-
b94f8d4ed8
Test: fused router kernel vs BF16 reference path
biondizzle
2026-06-01 08:54:24 +00:00
-
2433700a69
Fused router kernel: rewrite epilogue with proper CuTeDSL constructs
biondizzle
2026-06-01 08:49:39 +00:00
-
d01b4b02de
Complete NVFP4 fused router kernel: full MMA + router epilogue
biondizzle
2026-06-01 08:37:10 +00:00
-
25b9a5f32d
Fix test: use from_dlpack for c_tensor
biondizzle
2026-06-01 07:55:29 +00:00
-
d2819fc39c
Fix test: use as_tensor instead of make_tensor
biondizzle
2026-06-01 07:54:36 +00:00
-
5ea71ebd78
Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles)
biondizzle
2026-06-01 07:53:43 +00:00
-
fa6dbd4aa2
WIP: Rewrite NVFP4 fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16)
biondizzle
2026-06-01 07:53:21 +00:00
-
4f706b55d7
Remove raw CUDA C++ fused router and DeepGEMM (MXFP4, wrong instruction)
biondizzle
2026-06-01 07:51:31 +00:00
-
424fe6bf2c
Fix: use SM100_MMA_MXF8F6F4_SS (not MXF4) to match Nvfp4Linear path
biondizzle
2026-06-01 07:44:53 +00:00
-
2e2caadf7d
WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives
biondizzle
2026-06-01 07:41:42 +00:00
-
e3ea609ddd
Embed DeepGEMM source (not submodule) for SM100 raw CUDA GEMM primitives
biondizzle
2026-06-01 07:39:40 +00:00
-
dae83723a3
Add DeepGEMM as third-party dependency for SM100 raw CUDA GEMM primitives
biondizzle
2026-06-01 07:39:38 +00:00
-
ef4c0ad489
Fix BF16 router mma_tiler: use cutlass.Int32 for CuTe DSL compatibility
biondizzle
2026-06-01 07:29:30 +00:00
-
79be9cb8da
Fix: hardcode mma_inst_shape_k=32 for NVFP4 (avoids MLIR unpack error in JIT)
biondizzle
2026-06-01 07:20:23 +00:00
-
c3a64ceed7
Fix: mma_tiler must use CuTe Ints for static layout construction
biondizzle
2026-06-01 07:19:15 +00:00
-
39b481e52b
Ensure mma_tiler contains CuTe Ints for cute.slice_ compatibility
biondizzle
2026-06-01 07:16:47 +00:00
-
57cc20d5ad
Fix SFA/SFB SMEM: blockscaled layouts are plain Layout (no .outer/.inner swizzle)
biondizzle
2026-06-01 07:14:45 +00:00
-
fcd7680583
Fix CuTe tensor creation: use from_dlpack + mark_layout_dynamic
biondizzle
2026-06-01 07:12:52 +00:00
-
3a8c6daeb3
Fix: cutlass_torch.make_tensor -> as_tensor
biondizzle
2026-06-01 07:11:43 +00:00
-
0553117af6
Simplify fused router test: compare fused vs 2-kernel NVFP4 path
biondizzle
2026-06-01 07:10:55 +00:00
-
44a0e59808
Fix fused router test: use quantize_weight_to_nvfp4 (correct function name)
biondizzle
2026-06-01 07:08:56 +00:00
-
940f37fb6c
NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup
v-nvfp4-fused-router-rewrite-20260601-0715
biondizzle
2026-06-01 07:08:12 +00:00
-
8658c8eca5
fix: add sf_vec_size parameter back to Nvfp4FusedRouterKernel __init__
biondizzle
2026-06-01 07:01:02 +00:00
-
b97f30e289
fix: store sf_vec_size as instance variable
biondizzle
2026-06-01 06:56:33 +00:00
-
c225d195ea
fix: remove tcgen05.mma.Kind (doesn't exist), use make_blockscaled_trivial_tiled_mma
biondizzle
2026-06-01 06:54:49 +00:00
-
e6803b450d
rewrite: simplified fused router test (reference + import check)
biondizzle
2026-06-01 06:53:17 +00:00
-
262cec262d
fix: add shape assertions to fused router test
biondizzle
2026-06-01 06:51:47 +00:00
-
db07d17a62
fix: set activation global scale in fused router test
biondizzle
2026-06-01 06:50:41 +00:00
-
2abb4a19d9
fix: set gs and ws2 fields for Nvfp4Linear in fused router test
biondizzle
2026-06-01 06:49:43 +00:00
-
61c04f7152
fix: Nvfp4Linear field is sf not scale_b
biondizzle
2026-06-01 06:48:39 +00:00
-
982f245c67
fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb)
biondizzle
2026-06-01 06:47:15 +00:00
-
16af96380f
fix: use internal fields for Nvfp4Linear weight setup in test
biondizzle
2026-06-01 06:46:05 +00:00
-
7f1f224c78
fix: quantize_weight_to_nvfp4 returns 3 values, not 4
biondizzle
2026-06-01 06:43:53 +00:00
-
27fd847dd0
fix: correct quantize function name in fused router test
biondizzle
2026-06-01 06:41:54 +00:00
-
0873d65253
test: add fused router kernel test
biondizzle
2026-06-01 06:40:46 +00:00
-
90b2581dfe
feat: NVFP4 fused router CuTeDSL kernel (WIP)
biondizzle
2026-06-01 06:40:21 +00:00
-
6c28c57b6a
feat: Nvfp4GroupedLinear for o_a_proj (replaces BF16 grouped BMM)
v-nvfp4-router-oa-20260601-0610
biondizzle
2026-06-01 06:00:36 +00:00
-
cf2b7ab7ec
feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)
biondizzle
2026-06-01 05:58:56 +00:00
-
9f14cb17d1
test: add compressor position_bias unit test
biondizzle
2026-06-01 05:55:05 +00:00
-
84ca520bfb
fix: move compressor position_bias into CUDA kernel (was Python loop)
biondizzle
2026-06-01 05:54:44 +00:00
-
311fae490f
tune: reduce verbose diagnostics, print every decode step
v-e2e-paris-32tok-20260601-0549
biondizzle
2026-06-01 05:40:48 +00:00
-
df8acae66b
fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks
v-single-shot-paris-20260601-0539
biondizzle
2026-06-01 05:24:18 +00:00
-
62041b78bf
fix: import torch.utils.cpp_extension explicitly in production_compress
biondizzle
2026-06-01 05:20:44 +00:00
-
2155fd6c90
test: production compressor kernel unit test
biondizzle
2026-06-01 05:19:13 +00:00
-
b380028c49
feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel
biondizzle
2026-06-01 05:18:59 +00:00
-
6e53e3007c
fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow
v-working-e2e-20260601-0515
biondizzle
2026-06-01 04:59:06 +00:00
-
eb9c46f8cb
test: quantize on different GPUs
biondizzle
2026-06-01 04:48:30 +00:00
-
9ce7304783
test: direct SE L1 test on different GPUs
biondizzle
2026-06-01 04:43:48 +00:00
-
ce608d0e50
test: fix gemm 1-group test params
biondizzle
2026-06-01 04:40:07 +00:00
-
c652177970
test: fix gemm 1-group test
biondizzle
2026-06-01 04:35:55 +00:00
-
793f062bbc
auto: pre-test push for test_gemm_1group.py
biondizzle
2026-06-01 04:32:29 +00:00
-
86cb0e64a6
auto: pre-test push for test_se_dequant.py
biondizzle
2026-06-01 04:30:37 +00:00
-
9ba051cf49
test: fix gsa in SE multi-GPU test
biondizzle
2026-06-01 04:26:03 +00:00
-
419112dd3e
auto: pre-test push for test_se_multi_gpu.py
biondizzle
2026-06-01 04:22:38 +00:00
-
2cbc7459b0
diag: fix SE scale print (cast to float first)
biondizzle
2026-06-01 04:14:47 +00:00
-
bcd7a0cf0d
diag: check SE weight and scale integrity for first 3 layers
biondizzle
2026-06-01 04:08:21 +00:00
-
8ad617e2ff
diag: NaN detection in shared expert gate/up split
biondizzle
2026-06-01 04:01:46 +00:00
-
a53936a17c
diag: print l1_out shape warning in shared expert
biondizzle
2026-06-01 03:54:29 +00:00
-
db30c4acd6
auto: pre-test push for test_se_gpu.py
biondizzle
2026-06-01 03:50:53 +00:00
-
3dd95ce77b
fix: set activation global scales AFTER _ensure_stacked/_ensure_initialized (which override them)
biondizzle
2026-06-01 03:43:09 +00:00
-
27c63b01d6
diag: remove broken SE reference comparison, add gsa/gsb print
biondizzle
2026-06-01 03:31:36 +00:00
-
9a27ed21fd
diag: compare shared expert output with PyTorch reference
biondizzle
2026-06-01 03:25:21 +00:00
-
ee8318ad58
diag: handle NaN in shared expert output print
biondizzle
2026-06-01 03:16:25 +00:00
-
7000762309
diag: fix SE weight attribute name
biondizzle
2026-06-01 03:09:11 +00:00
-
fba1c06cad
diag: check SE weight integrity
biondizzle
2026-06-01 03:02:44 +00:00
-
22d7cc9b7a
diag: cuda sync check after shared expert for first 3 layers
biondizzle
2026-06-01 02:56:28 +00:00
-
b85fcf4d6f
diag: print SE global scales for first 3 layers
biondizzle
2026-06-01 02:49:55 +00:00
-
48d93a6d2e
diag: MoE input/output diagnostics for first 3 layers
biondizzle
2026-06-01 02:41:12 +00:00
-
856a459a98
fix: init l1_gsa_list and l2_gsa_list
biondizzle
2026-06-01 02:34:21 +00:00
-
66b98e5794
fix: MoE and shared expert global scale — gsb=ws2, gsa=input_scale (same bug as Nvfp4Linear)
biondizzle
2026-06-01 02:31:12 +00:00
-
f4b444b456
fix: NVFP4 global scale bug — gsb=weight_scale_2 (not input_scale*ws2), gsa=input_scale
biondizzle
2026-06-01 02:19:35 +00:00
-
1eed28dd09
diag: compare production FMHA and NVFP4 linear output with PyTorch reference
biondizzle
2026-06-01 02:12:39 +00:00
-
df394f8b40
fix: missing closing quote on string literal
biondizzle
2026-06-01 02:02:14 +00:00
-
cfd2468c61
fix: decode loop also needs int32 token_ids for hash router
biondizzle
2026-06-01 01:58:45 +00:00
-
905623793b
fix: move token_ids to same GPU as router (was cuda:0 but router on cuda:N)
biondizzle
2026-06-01 01:49:40 +00:00