nvfp4-megamoe-kernel/dsv4/kernels at 2dc5b4ec195d2d116b8b16a5f2ff4e11d87f1baa - nvfp4-megamoe-kernel - Gitea: Git with a cup of tea

biondizzle/nvfp4-megamoe-kernel

Files

History

biondizzle 2dc5b4ec19 Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128

128 * (sizeof(float) + sizeof(int)) = 1KB — within CUDA default stack limit.
256 * 8 = 2KB would overflow.

2026-06-01 20:42:53 +00:00

..

FMHA sink: don't double-scale sink bias

2026-05-31 23:12:20 +00:00

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

fix: import torch.utils.cpp_extension explicitly in production_compress

2026-06-01 05:20:44 +00:00

Fix sampler kernel stack overflow: reduce MAX_K from 256 to 128

2026-06-01 20:42:53 +00:00

NVFP4-1.1: Mark fp4_quant.py as toolchain-blocked, clean up test files

2026-05-28 04:59:01 +00:00

Wire indexer compute_index_scores_topk + fix compressor imports

2026-05-30 21:19:06 +00:00

Switch router to Nvfp4Linear production GEMM (custom CuTeDSL kernel crashes MLIR)

2026-06-01 11:17:54 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00