fix: use full padded_scales_buf (no GPU scalar slicing in cudagraph)

buf[:gpu_scalar, :] triggers cudaErrorStreamCaptureInvalidated.
Always use the full pre-allocated buffer; extra rows are zeros.
This commit is contained in:
2026-05-16 18:50:35 +00:00
parent 2f68c7ba77
commit 103fd451ce

View File

@@ -162,10 +162,8 @@ class CuTeDSLMoERunner:
padded_expert_offsets.zero_()
padded_expert_offsets[1:] = padded_rows_per_expert.cumsum(0)
total_padded_rows = padded_expert_offsets[-1]
# Reset the padded scales buffer
padded_scales = self._padded_scales_buf[:total_padded_rows, :padded_cols]
# Use the FULL pre-allocated scales buffer (no GPU scalar slicing)
padded_scales = self._padded_scales_buf
padded_scales.zero_()
# Build index mapping: for each row in x_sf, which expert does it belong to?