nvfp4-megamoe-kernel

Files

biondizzle faf92b30ad E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

- gather_compressed_kv: CSA top-k gather via existing gather_kv.cu
- gather_all_compressed_kv: HCA dense gather via new gather_all_compressed_kernel
- gather_swa_kv: SWA ring buffer gather via new gather_swa_kernel
- Added gather_swa.cu with both SWA + all-compressed gather kernels
- Added gather.py Python wrapper (torch.utils.cpp_extension JIT)
- Updated handle.py: added schema field, num_query_heads/head_dim properties
- Updated manager.py: passes schema + num_query_heads to handle

All gather kernels: FP8→BF16 dequant + BF16 RoPE concat in single launch.
Output: dense BF16 tensors ready for FMHA consumption.

2026-05-30 21:09:21 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

allocator.py

KV Cache: schema, allocator, pools, manager, append_swa kernel

2026-05-22 00:08:38 +00:00

block_table.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

flush.py

Flush compressor: schema fix, prepare_forward, flush_write kernels, state rotation