Files
nvfp4-megamoe-kernel/cutedsl
biondizzle 2f68c7ba77 fix: cache E2M1 step_to_idx LUT per device (no CPU->CUDA copy in forward)
torch.tensor() and new_tensor() both trigger CPU->CUDA copies during
cudagraph capture. Pre-cache the LUT on first use per device.
2026-05-16 18:48:31 +00:00
..