[Spec Decode] Reduce TP communication for speculative decoding draft token generation (#34049)

Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com>
2026-02-22 14:59:16 -08:00
parent b7892a3bef
commit 2bcf71b9c0
4 changed files with 114 additions and 6 deletions
--- a/vllm/config/speculative.py
+++ b/vllm/config/speculative.py
@@ -109,6 +109,11 @@ class SpeculativeConfig:
    speculative input batches can contain sequences of different lengths,
    which may only be supported by certain attention backends. This currently
    only affects the EAGLE method of speculation."""
+    use_local_argmax_reduction: bool = False
+    """Use vocab-parallel local argmax instead of all-gathering full logits
+    for draft token generation. Reduces communication from O(vocab_size) to
+    O(2 * tp_size) per token. Only applies to greedy draft selection in
+    non-tree speculation."""

    # Ngram proposer configuration
    prompt_lookup_max: int | None = Field(default=None, ge=1)