nvfp4-megamoe-kernel

Files

biondizzle 24fed15ed6 Fix: convert PyTorch tensors to CuTe tensors for fused router kernel

- Added cutlass_torch.from_dlpack() + mark_layout_dynamic() conversions
- quantize_activation_nvfp4 returns (fp4_packed, fp8_scales) which are
  converted to CuTe tensors before passing to the kernel
- Same pattern as gemm_runner.py

2026-06-01 10:02:40 +00:00

attention

FMHA sink: don't double-scale sink bias

2026-05-31 23:12:20 +00:00

cache

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

compressor

fix: import torch.utils.cpp_extension explicitly in production_compress

2026-06-01 05:20:44 +00:00

cuda

fix: move compressor position_bias into CUDA kernel (was Python loop)

2026-06-01 05:54:44 +00:00

gemm

NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files

2026-05-28 04:59:01 +00:00

indexer

Wire indexer compute_index_scores_topk + fix compressor imports

2026-05-30 21:19:06 +00:00

router

Fix: convert PyTorch tensors to CuTe tensors for fused router kernel

2026-06-01 10:02:40 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00