[Spec Decode] Reduce TP communication for speculative decoding draft token generation (#34049)

Signed-off-by: qizixi <qizixi@meta.com>
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
This commit is contained in:
qizixi
2026-02-22 14:59:16 -08:00
committed by GitHub
parent b7892a3bef
commit 2bcf71b9c0
4 changed files with 114 additions and 6 deletions

View File

@@ -109,6 +109,11 @@ class SpeculativeConfig:
speculative input batches can contain sequences of different lengths,
which may only be supported by certain attention backends. This currently
only affects the EAGLE method of speculation."""
use_local_argmax_reduction: bool = False
"""Use vocab-parallel local argmax instead of all-gathering full logits
for draft token generation. Reduces communication from O(vocab_size) to
O(2 * tp_size) per token. Only applies to greedy draft selection in
non-tree speculation."""
# Ngram proposer configuration
prompt_lookup_max: int | None = Field(default=None, ge=1)