nvfp4-megamoe-kernel

Files

biondizzle 5290c91c35 fix quantize_nvfp4 kernel: use proven single-thread-per-CTA pattern from deinterleave_quantize.cu

The warp shuffle approach failed because __shfl_down_sync with 16 threads
has undefined behavior for the odd nibble. Use the same pattern as the
working deinterleave_quantize.cu: 1 CTA per 16-element block, 16 threads
per CTA, each thread reads all 16 elements sequentially and computes
amax + quantize + pack.

2026-05-25 16:21:44 +00:00

attention

Revert D2 multi-CTA attempts - keeping per-head launch approach (works correctly)

2026-05-25 01:08:38 +00:00

cache

KV Cache: schema, allocator, pools, manager, append_swa kernel

2026-05-22 00:08:38 +00:00

compressor

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

cuda

fix quantize_nvfp4 kernel: use proven single-thread-per-CTA pattern from deinterleave_quantize.cu