[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631)
This commit is contained in:
@@ -88,7 +88,9 @@ class LLM:
|
||||
to eager mode (DEPRECATED. Use `max_seq_len_to_capture` instead).
|
||||
max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
|
||||
When a sequence has context length larger than this, we fall back
|
||||
to eager mode.
|
||||
to eager mode. Additionally for encoder-decoder models, if the
|
||||
sequence length of the encoder input is larger than this, we fall
|
||||
back to the eager mode.
|
||||
disable_custom_all_reduce: See ParallelConfig
|
||||
**kwargs: Arguments for :class:`~vllm.EngineArgs`. (See
|
||||
:ref:`engine_args`)
|
||||
@@ -137,9 +139,7 @@ class LLM:
|
||||
LLM constructor.
|
||||
|
||||
Note: if enforce_eager is unset (enforce_eager is None)
|
||||
it defaults to False for decoder-only models and True
|
||||
for encoder/decoder models, since encoder/decoder models
|
||||
do not currently support CUDAGraph.
|
||||
it defaults to False.
|
||||
'''
|
||||
|
||||
if "disable_log_stats" not in kwargs:
|
||||
|
||||
Reference in New Issue
Block a user