nvfp4-megamoe-kernel

Files

biondizzle df6220abaf E5: Fold batch loop into native kernel grid (blockIdx.z)

The 6-warp multi-tile kernel already supports batch natively via
dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input.
Single kernel launch per layer for batched decode instead of
batch_size launches.

T>1 prefill still uses per-batch dispatch (E8 future work).

2026-05-30 21:21:02 +00:00

cache

E1: Wire LayerCacheHandle gather methods + CUDA gather kernels

2026-05-30 21:09:21 +00:00

kernels

E5: Fold batch loop into native kernel grid (blockIdx.z)

2026-05-30 21:21:02 +00:00

layers

E2/E3: compressor bridge, indexer bridge, flush pipeline wiring

2026-05-30 21:16:54 +00:00

loader

Restructure: cutedsl/ -> dsv4/ with proper layering