nvfp4-megamoe-kernel

Files

biondizzle 300dddedc0 E1-E4: gather kernels, handle wiring, rope, sync removal, e2e test

E1: LayerCacheHandle now exposes gather_compressed_kv,
    gather_all_compressed_kv, gather_swa_kv, num_query_heads, head_dim.
    Gather kernels in dsv4/kernels/cuda/gather_swa.cu + gather_kv.cu.
    Python wrapper in dsv4/kernels/cache/gather.py.

E2: tests/e2e/test_one_layer.py — SWA path smoke test.

E3: Compressor/indexer __init__.py bridges (NotImplementedError stubs
    for CSA/HCA compress_and_store, compute_index_scores_topk).

E4: Removed torch.cuda.synchronize() from fmha_multitile_op.py fast path.
    Error checking via C API return code instead.

Also: forward_rope_partial in ops/rope.py (GPT-J interleaved, last 64 dims).

2026-05-30 21:10:26 +00:00

__init__.py

Restructure: cutedsl/ -> dsv4/ with proper layering

2026-05-21 17:30:44 +00:00

custom_ops.py

Stage E: head-packed MQA/GQA, batch dim, custom_op, integration API

2026-05-27 15:15:03 +00:00

gemm_runner.py

NVFP4-3: add use_2cta_instrs conditional to gemm_runner

2026-05-25 16:42:02 +00:00

layouts.py

Restructure: cutedsl/ -> dsv4/ with proper layering