nvfp4-megamoe-kernel

Files

biondizzle 5290c91c35 fix quantize_nvfp4 kernel: use proven single-thread-per-CTA pattern from deinterleave_quantize.cu

The warp shuffle approach failed because __shfl_down_sync with 16 threads
has undefined behavior for the odd nibble. Use the same pattern as the
working deinterleave_quantize.cu: 1 CTA per 16-element block, 16 threads
per CTA, each thread reads all 16 elements sequentially and computes
amax + quantize + pack.

2026-05-25 16:21:44 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

_hash_router.py

Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill

2026-05-21 21:54:05 +00:00

activation_topk.cu

Router: full kernel stack — hash, topk, activation+topk, dense decode/prefill

2026-05-21 21:54:05 +00:00

append_swa.cu

KV Cache: schema, allocator, pools, manager, append_swa kernel