nvfp4-megamoe-kernel

Files

biondizzle 84ca520bfb fix: move compressor position_bias into CUDA kernel (was Python loop)

The compressor_reduce.cu kernel now adds position_bias to BOTH kv and
gate values, matching the PyTorch reference. Previously the kernel only
added it to gate, and a Python workaround loop was adding it to both
before the kernel call (then passing None to the kernel).

Changes:
- compressor_reduce.cu: add position_bias to kv_val in pass 2 (CSA + HCA)
- single_shot_inference.py: remove Python position_bias loop, pass
  self.ape directly to csa/hca_compress_production
- production_compress.py: already supports position_bias passthrough

2026-06-01 05:54:44 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

_hash_router.py

Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill

2026-05-21 21:54:05 +00:00

activation_topk.cu

Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill

2026-05-21 21:54:05 +00:00

append_swa.cu

KV Cache: schema, allocator, pools, manager, append_swa kernel