Commit Graph

2077 Commits

Author SHA1 Message Date
9b86b2b414 Test: fix fused router test - proper NVFP4 quantization and CuTe tensor setup
- Use quantize_to_nvfp4 for weight quantization
- Use quantize_activation_nvfp4 with computed global_scale
- Get mat_b and scale_b from Nvfp4Linear after finalize_weights
- Compare against both BF16 reference and NVFP4 GEMM reference
2026-06-01 08:56:20 +00:00
b94f8d4ed8 Test: fused router kernel vs BF16 reference path
- BF16 GEMM + activation_topk as reference
- NVFP4 GEMM + fused router epilogue as test target
- Proper NVFP4 quantization and CuTe tensor creation
- Cosine similarity and topk_ids matching validation
2026-06-01 08:54:24 +00:00
2433700a69 Fused router kernel: rewrite epilogue with proper CuTeDSL constructs
- Replace Python lists with individual scalar variables (s0..s5, i0..i5, a0..a5)
- Replace min-heap sift-down with fully unrolled sorted insertion
  (descending order, no dynamic indexing, no while loops)
- Replace raw SMEM pointer arithmetic with CuTeDSL SMEM tensors
  (s_merge_s, s_merge_i, s_merge_a)
- Replace cute.where with cute.math.fmax
- Fix expert index calculation: col + tile_n_offset + subtile_idx * epi_n
- Top-6 accumulates across all N-tiles (for E=384 with 3 tiles of 128)
- Add iter_acc_early_release for overlapping accumulator
- Rewrite test to compare fused kernel vs 2-kernel reference path
- Remove stale memory doc
2026-06-01 08:49:39 +00:00
d01b4b02de Complete NVFP4 fused router kernel: full MMA + router epilogue
- TMA warp: persistent tile scheduling + TMA loads for A/B/SFA/SFB
- MMA warp: blockscaled GEMM (tcgen05.mma.block_scale) with S2T copy
  for SFA/SFB, proper pipeline synchronization (AB + Acc pipelines)
- Epilogue warps: TMEM->register via epilogue_tmem_copy_and_partition,
  sqrt(softplus) + e_bias + min-heap top-k + renormalization
- Python wrapper: run_nvfp4_fused_router() with proper CuTe tensor
  creation via from_dlpack + mark_layout_dynamic
- Single-kernel path, no BF16 fallback, no intermediate GMEM buffer
- Following exact patterns from MoE fused_swiglu.py kernel
2026-06-01 08:37:10 +00:00
25b9a5f32d Fix test: use from_dlpack for c_tensor 2026-06-01 07:55:29 +00:00
d2819fc39c Fix test: use as_tensor instead of make_tensor 2026-06-01 07:54:36 +00:00
5ea71ebd78 Add NVFP4 CuTeDSL compilation test (verify MmaMXF4NVF4Op compiles) 2026-06-01 07:53:43 +00:00
fa6dbd4aa2 WIP: Rewrite NVFP4 fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16)
Uses kind::mxf4nvf4 — native NVF4 with E2M1 microscales, 16-elem blocks.
NO MXFP4, NO CONVERSIONS.

Kernel incomplete — GEMM mainloop mirrors dense.py but epilogue is TODO.
Need to verify CuTeDSL compilation works with proper PipelineTmaUmma/
PipelineUmmaAsync abstractions before adding top-k epilogue.
2026-06-01 07:53:21 +00:00
4f706b55d7 Remove raw CUDA C++ fused router and DeepGEMM (MXFP4, wrong instruction)
DeepGEMM uses kind::mxf4.block_scale.block32 (MXFP4, UE8M0 scales, 32-elem blocks).
DSV4 uses NVF4: kind::mxf4nvf4 (E2M1 microscales, 16-elem blocks).
Using MXFP4 would require E2M1->UE8M0 conversion. NO CONVERSIONS.

Rewriting fused router in CuTeDSL with MmaMXF4NVF4Op (sf_vec_size=16).
2026-06-01 07:51:31 +00:00
424fe6bf2c Fix: use SM100_MMA_MXF8F6F4_SS (not MXF4) to match Nvfp4Linear path
MXF4 has .block32 hardcoded. MXF8F6F4 matches what CuTeDSL generates
via make_instr_desc_block_scaled. Both use E2M1 data + UE8M0 scales
at hardware level. NVFP4 E2M1 microscales are combined into UE8M0
during quantization — no MXFP4 conversion.
2026-06-01 07:44:53 +00:00
2e2caadf7d WIP: NVFP4 fused router kernel in raw CUDA C++ using DeepGEMM primitives
- nvfp4_fused_router_kernel.cuh: 1-CTA NVFP4 GEMM + sqrt(softplus) + top-k epilogue
- Uses DeepGEMM SM100 primitives: SM100_MMA_MXF4_SS, UTCCP, UMMA descriptors
- 4 warp roles: TMA load, UTCCP transpose, MMA issue, epilogue
- nvfp4_fused_router_cuda.py: Python wrapper (TMA descriptor setup TBD)

NOT YET COMPILING - needs:
1. SMEM layout fix (single extern __shared__)
2. TMA descriptor creation (cuTensorMapEncodeTiled)
3. Top-k cross-warp merge completion
4. FP4 tensor format alignment with DeepGEMM
2026-06-01 07:41:42 +00:00
e3ea609ddd Embed DeepGEMM source (not submodule) for SM100 raw CUDA GEMM primitives 2026-06-01 07:39:40 +00:00
dae83723a3 Add DeepGEMM as third-party dependency for SM100 raw CUDA GEMM primitives 2026-06-01 07:39:38 +00:00
ef4c0ad489 Fix BF16 router mma_tiler: use cutlass.Int32 for CuTe DSL compatibility 2026-06-01 07:29:30 +00:00
79be9cb8da Fix: hardcode mma_inst_shape_k=32 for NVFP4 (avoids MLIR unpack error in JIT) 2026-06-01 07:20:23 +00:00
c3a64ceed7 Fix: mma_tiler must use CuTe Ints for static layout construction 2026-06-01 07:19:15 +00:00
39b481e52b Ensure mma_tiler contains CuTe Ints for cute.slice_ compatibility 2026-06-01 07:16:47 +00:00
57cc20d5ad Fix SFA/SFB SMEM: blockscaled layouts are plain Layout (no .outer/.inner swizzle) 2026-06-01 07:14:45 +00:00
fcd7680583 Fix CuTe tensor creation: use from_dlpack + mark_layout_dynamic 2026-06-01 07:12:52 +00:00
3a8c6daeb3 Fix: cutlass_torch.make_tensor -> as_tensor 2026-06-01 07:11:43 +00:00
0553117af6 Simplify fused router test: compare fused vs 2-kernel NVFP4 path 2026-06-01 07:10:55 +00:00
44a0e59808 Fix fused router test: use quantize_weight_to_nvfp4 (correct function name) 2026-06-01 07:08:56 +00:00
940f37fb6c NVFP4 fused router kernel: full rewrite with proper block-scaled GEMM setup
Major fixes:
- Added tiled_mma_sfb creation (always CtaGroup.ONE, rounded N)
- Added mma_tiler_sfb, cta_tile_shape_mnk_sfb, cluster_layout_sfb_vmnk
- Use blockscaled_utils.make_smem_layout_sfa/sfb (with sf_vec_size)
  instead of sm100_utils (which doesn't support block-scaled SF layouts)
- Proper TMEM column accounting for SFA + SFB + accumulator
- Fixed make_blockscaled_trivial_tiled_mma argument order
  (a_dtype, b_dtype, a_major, b_major, sf_dtype, sf_vec_size, cta_group, mma_inst_shape)
- Fixed SFB TMA atom to use tiled_mma_sfb and cluster_layout_sfb_vmnk
- Fixed SFB partition_SFB to use tiled_mma_sfb.get_slice
- Fixed SFB global tile partitioning to use mma_tiler_sfb
- Fixed mainloop_s2t_copy_and_partition to use TMEM fragments
  (make_fragment_SFA/SFB) as the tSF parameter
- Updated run_nvfp4_fused_router wrapper to accept processed weight
  tensors from Nvfp4Linear._mat_b and _scale_b
- Updated test to properly build Nvfp4Linear and use processed weights

The old code was a rough sketch that never worked — it was missing
the entire tiled_mma_sfb infrastructure, used wrong SMEM layout
functions, and had broken TMA atom setup for scale factors.
v-nvfp4-fused-router-rewrite-20260601-0715
2026-06-01 07:08:12 +00:00
8658c8eca5 fix: add sf_vec_size parameter back to Nvfp4FusedRouterKernel __init__ 2026-06-01 07:01:02 +00:00
b97f30e289 fix: store sf_vec_size as instance variable 2026-06-01 06:56:33 +00:00
c225d195ea fix: remove tcgen05.mma.Kind (doesn't exist), use make_blockscaled_trivial_tiled_mma 2026-06-01 06:54:49 +00:00
e6803b450d rewrite: simplified fused router test (reference + import check) 2026-06-01 06:53:17 +00:00
262cec262d fix: add shape assertions to fused router test 2026-06-01 06:51:47 +00:00
db07d17a62 fix: set activation global scale in fused router test 2026-06-01 06:50:41 +00:00
2abb4a19d9 fix: set gs and ws2 fields for Nvfp4Linear in fused router test 2026-06-01 06:49:43 +00:00
61c04f7152 fix: Nvfp4Linear field is sf not scale_b 2026-06-01 06:48:39 +00:00
982f245c67 fix: use correct Nvfp4Linear field names (fp4, scale_b, gsb) 2026-06-01 06:47:15 +00:00
16af96380f fix: use internal fields for Nvfp4Linear weight setup in test 2026-06-01 06:46:05 +00:00
7f1f224c78 fix: quantize_weight_to_nvfp4 returns 3 values, not 4 2026-06-01 06:43:53 +00:00
27fd847dd0 fix: correct quantize function name in fused router test 2026-06-01 06:41:54 +00:00
0873d65253 test: add fused router kernel test
Compares NVFP4 fused CuTeDSL kernel against reference
(Nvfp4Linear + activation_topk) for correctness.
2026-06-01 06:40:46 +00:00
90b2581dfe feat: NVFP4 fused router CuTeDSL kernel (WIP)
Single-kernel NVFP4 block-scaled GEMM + fused sqrt(softplus) + top-k
epilogue. Avoids materializing intermediate FP32 logits to GMEM.

Architecture: 6-warp specialization
- Warp 5 (TMA): Load A, B, SFA, SFB from GMEM → SMEM
- Warp 4 (MMA): NVFP4 block-scaled GEMM → FP32 accumulator in TMEM
- Warps 0-3 (EPI): TMEM → registers → sqrt(softplus) + bias + top-k → GMEM

Epilogue maintains per-thread min-heap across N subtiles, then
merges all 128 threads' heaps in SMEM for final top-k selection.

Mirrors Sm100BlockScaledPersistentDenseGemmKernel structure for
TMA/MMA/SFA/SFB handling, with custom top-k epilogue replacing
the standard SwiGLU + TMA store path.

NOTE: This is WIP — needs compilation testing on B200. Several
API details (tiled_mma_sfb, cluster_layout_sfb_vmnk) need to
be passed through the kernel parameters properly.
2026-06-01 06:40:21 +00:00
6c28c57b6a feat: Nvfp4GroupedLinear for o_a_proj (replaces BF16 grouped BMM)
The attention output projection first half (wo_a) was using BF16
grouped BMM (torch.bmm). Now uses production Nvfp4GroupedLinear
which performs the same grouped GEMM with NVFP4 tensor-core
acceleration on Blackwell.

The weight is loaded from NVFP4 checkpoint if available, otherwise
quantized from BF16 via set_bf16_weight().

Also includes:
- NVFP4 gate projection for router (from previous commit)
- Compressor position_bias in CUDA kernel (from earlier fix)
v-nvfp4-router-oa-20260601-0610
2026-06-01 06:00:36 +00:00
cf2b7ab7ec feat: NVFP4 gate projection for router (replaces BF16 cuBLAS)
The dense router now uses NVFP4 GEMM via Nvfp4Linear for the gate
projection when NVFP4 scales are available in the checkpoint. This
replaces the BF16 cuBLAS GEMM with Blackwell SM100 tensor-core
NVFP4 acceleration.

Changes:
- dsv4/layers/router.py: add gate_lin (Nvfp4Linear) alongside W_gate
  fallback. New load_nvfp4_gate() method.
- dsv4/kernels/router/dense_router_decode.py: add
  dense_router_dispatch_nvfp4() using Nvfp4Linear + activation_topk
- dsv4/kernels/router/__init__.py: export new function
- single_shot_inference.py: load NVFP4 gate weights when available,
  fall back to BF16 when not
2026-06-01 05:58:56 +00:00
9f14cb17d1 test: add compressor position_bias unit test
Verifies CUDA kernel matches PyTorch reference with and without
position_bias for both CSA (m=4) and HCA (m=128) paths.
2026-06-01 05:55:05 +00:00
84ca520bfb fix: move compressor position_bias into CUDA kernel (was Python loop)
The compressor_reduce.cu kernel now adds position_bias to BOTH kv and
gate values, matching the PyTorch reference. Previously the kernel only
added it to gate, and a Python workaround loop was adding it to both
before the kernel call (then passing None to the kernel).

Changes:
- compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA)
- single_shot_inference.py: remove Python position_bias loop, pass
  self.ape directly to csa/hca_compress_production
- production_compress.py: already supports position_bias passthrough
2026-06-01 05:54:44 +00:00
311fae490f tune: reduce verbose diagnostics, print every decode step v-e2e-paris-32tok-20260601-0549 2026-06-01 05:40:48 +00:00
df8acae66b fix: rewrite compressor_reduce.cu — no extern shared mem, proper bounds checks v-single-shot-paris-20260601-0539 2026-06-01 05:24:18 +00:00
62041b78bf fix: import torch.utils.cpp_extension explicitly in production_compress 2026-06-01 05:20:44 +00:00
2155fd6c90 test: production compressor kernel unit test 2026-06-01 05:19:13 +00:00
b380028c49 feat: production compressor/indexer — NVFP4 GEMM + CUDA softmax/reduce kernel
- New compressor_reduce.cu: CSA/HCA token-level softmax + weighted sum + kv_norm
  One block per compressed entry, 128 threads, FP32 accumulation
  CSA: overlapping Ca/Cb streams (2m tokens per block)
  HCA: single stream (m tokens per block)
  Includes apply_kv_norm kernel (unweighted RMSNorm + weight)

- New production_compress.py: Python wrapper for CUDA kernels

- single_shot_inference.py: Compressor/Indexer now use production Nvfp4Linear
  for kv_proj, gate_proj, q_b_proj, weights_proj projections
  Then CUDA reduce kernel for softmax + weighted sum
  No more PyTorch reference nvfp4_linear_ref in compressor/indexer path
2026-06-01 05:18:59 +00:00
6e53e3007c fix: clamp block_amax to E4M3 max (448) in quantize_activation_nvfp4 — prevents NaN from overflow v-working-e2e-20260601-0515 2026-06-01 04:59:06 +00:00
eb9c46f8cb test: quantize on different GPUs 2026-06-01 04:48:30 +00:00
9ce7304783 test: direct SE L1 test on different GPUs 2026-06-01 04:43:48 +00:00
ce608d0e50 test: fix gemm 1-group test params 2026-06-01 04:40:07 +00:00