nvfp4-megamoe-kernel

Files

biondizzle df6220abaf E5: Fold batch loop into native kernel grid (blockIdx.z)

The 6-warp multi-tile kernel already supports batch natively via
dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input.
Single kernel launch per layer for batched decode instead of
batch_size launches.

T>1 prefill still uses per-batch dispatch (E8 future work).

2026-05-30 21:21:02 +00:00

attention

E5: Fold batch loop into native kernel grid (blockIdx.z)

2026-05-30 21:21:02 +00:00

cache

fix: correct gather.py kernel_dir path

2026-05-30 21:12:09 +00:00

compressor

Wire indexer compute_index_scores_topk + fix compressor imports

2026-05-30 21:19:06 +00:00

cuda

fix: extern declarations for gather_swa functions in gather_kv.cu