Files
nvfp4-megamoe-kernel/vllm
biondizzle ea8acf9852 Share padded_x_sf and output buffers across layers to save ~300 MB
Per-layer padded_xsf (2.4 MB) + output_buf (4.2 MB) × 60 layers = ~400 MB.
Sharing reduces to ~3.6 MB total. Layers run sequentially during both
capture and replay.
2026-05-17 16:05:53 +00:00
..