Files
nvfp4-megamoe-kernel/dsv4
biondizzle df6220abaf E5: Fold batch loop into native kernel grid (blockIdx.z)
The 6-warp multi-tile kernel already supports batch natively via
dim3 grid(1, n_h, batch). Removed Python for-loop for 4D input.
Single kernel launch per layer for batched decode instead of
batch_size launches.

T>1 prefill still uses per-batch dispatch (E8 future work).
2026-05-30 21:21:02 +00:00
..