[Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2024-12-08 01:20:49 +08:00
committed by GitHub
parent 39e227c7ae
commit c889d5888b
8 changed files with 32 additions and 9 deletions

View File

@@ -8,6 +8,9 @@ Speculative decoding
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_
.. warning::
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.