[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
This commit is contained in:
Ekagra Ranjan
2025-09-25 18:22:03 -04:00
committed by GitHub
parent 89fa54e6f7
commit e71b8e210d
5 changed files with 383 additions and 109 deletions

View File

@@ -17,7 +17,7 @@ PLACEHOLDER_TOKEN_ID: tl.constexpr = -1
GREEDY_TEMPERATURE: tl.constexpr = -1
# Maximum number of speculative draft tokens allowed per request in a single
# step. This value is chosen to be large enough to handle typical use cases.
MAX_SPEC_LEN = 32
MAX_SPEC_LEN = 128
class RejectionSampler(nn.Module):