[Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-08 01:20:49 +08:00
parent 39e227c7ae
commit c889d5888b
8 changed files with 32 additions and 9 deletions
--- a/docs/source/usage/spec_decode.rst
+++ b/docs/source/usage/spec_decode.rst
@@ -8,6 +8,9 @@ Speculative decoding
    not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
    to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_

+.. warning::
+    Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
+
 This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
 Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.