[Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
@@ -8,6 +8,9 @@ Speculative decoding
|
||||
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
|
||||
to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_
|
||||
|
||||
.. warning::
|
||||
Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
|
||||
|
||||
This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
|
||||
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user