Add full API docs and improve the UX of navigating them (#17485)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -130,11 +130,11 @@ class LLMEngine:
|
||||
iteration-level scheduling and efficient memory management to maximize the
|
||||
serving throughput.
|
||||
|
||||
The :class:`~vllm.LLM` class wraps this class for offline batched inference
|
||||
and the :class:`AsyncLLMEngine` class wraps this class for online serving.
|
||||
The {class}`~vllm.LLM` class wraps this class for offline batched inference
|
||||
and the {class}`AsyncLLMEngine` class wraps this class for online serving.
|
||||
|
||||
The config arguments are derived from :class:`~vllm.EngineArgs`. (See
|
||||
:ref:`engine-args`)
|
||||
The config arguments are derived from {class}`~vllm.EngineArgs`. (See
|
||||
{ref}`engine-args`)
|
||||
|
||||
Args:
|
||||
model_config: The configuration related to the LLM model.
|
||||
@@ -694,11 +694,11 @@ class LLMEngine:
|
||||
|
||||
Args:
|
||||
request_id: The unique ID of the request.
|
||||
prompt: The prompt to the LLM. See :class:`~vllm.inputs.PromptType`
|
||||
prompt: The prompt to the LLM. See {class}`~vllm.inputs.PromptType`
|
||||
for more details about the format of each input.
|
||||
params: Parameters for sampling or pooling.
|
||||
:class:`~vllm.SamplingParams` for text generation.
|
||||
:class:`~vllm.PoolingParams` for pooling.
|
||||
{class}`~vllm.SamplingParams` for text generation.
|
||||
{class}`~vllm.PoolingParams` for pooling.
|
||||
arrival_time: The arrival time of the request. If None, we use
|
||||
the current monotonic time.
|
||||
lora_request: The LoRA request to add.
|
||||
@@ -710,10 +710,10 @@ class LLMEngine:
|
||||
Details:
|
||||
- Set arrival_time to the current time if it is None.
|
||||
- Set prompt_token_ids to the encoded prompt if it is None.
|
||||
- Create `n` number of :class:`~vllm.Sequence` objects.
|
||||
- Create a :class:`~vllm.SequenceGroup` object
|
||||
from the list of :class:`~vllm.Sequence`.
|
||||
- Add the :class:`~vllm.SequenceGroup` object to the scheduler.
|
||||
- Create `n` number of {class}`~vllm.Sequence` objects.
|
||||
- Create a {class}`~vllm.SequenceGroup` object
|
||||
from the list of {class}`~vllm.Sequence`.
|
||||
- Add the {class}`~vllm.SequenceGroup` object to the scheduler.
|
||||
|
||||
Example:
|
||||
>>> # initialize engine
|
||||
@@ -861,8 +861,8 @@ class LLMEngine:
|
||||
|
||||
Details:
|
||||
- Refer to the
|
||||
:meth:`~vllm.core.scheduler.Scheduler.abort_seq_group`
|
||||
from class :class:`~vllm.core.scheduler.Scheduler`.
|
||||
{meth}`~vllm.core.scheduler.Scheduler.abort_seq_group`
|
||||
from class {class}`~vllm.core.scheduler.Scheduler`.
|
||||
|
||||
Example:
|
||||
>>> # initialize engine and add a request with request_id
|
||||
@@ -1258,53 +1258,56 @@ class LLMEngine:
|
||||
def step(self) -> List[Union[RequestOutput, PoolingRequestOutput]]:
|
||||
"""Performs one decoding iteration and returns newly generated results.
|
||||
|
||||
.. figure:: https://i.imgur.com/sv2HssD.png
|
||||
:alt: Overview of the step function
|
||||
:align: center
|
||||
:::{figure} https://i.imgur.com/sv2HssD.png
|
||||
:alt: Overview of the step function
|
||||
:align: center
|
||||
|
||||
Overview of the step function.
|
||||
Overview of the step function.
|
||||
:::
|
||||
|
||||
Details:
|
||||
- Step 1: Schedules the sequences to be executed in the next
|
||||
iteration and the token blocks to be swapped in/out/copy.
|
||||
- Step 1: Schedules the sequences to be executed in the next
|
||||
iteration and the token blocks to be swapped in/out/copy.
|
||||
|
||||
- Depending on the scheduling policy,
|
||||
sequences may be `preempted/reordered`.
|
||||
- A Sequence Group (SG) refer to a group of sequences
|
||||
that are generated from the same prompt.
|
||||
- Depending on the scheduling policy,
|
||||
sequences may be `preempted/reordered`.
|
||||
- A Sequence Group (SG) refer to a group of sequences
|
||||
that are generated from the same prompt.
|
||||
|
||||
- Step 2: Calls the distributed executor to execute the model.
|
||||
- Step 3: Processes the model output. This mainly includes:
|
||||
- Step 2: Calls the distributed executor to execute the model.
|
||||
- Step 3: Processes the model output. This mainly includes:
|
||||
|
||||
- Decodes the relevant outputs.
|
||||
- Updates the scheduled sequence groups with model outputs
|
||||
based on its `sampling parameters` (`use_beam_search` or not).
|
||||
- Frees the finished sequence groups.
|
||||
- Decodes the relevant outputs.
|
||||
- Updates the scheduled sequence groups with model outputs
|
||||
based on its `sampling parameters` (`use_beam_search` or not).
|
||||
- Frees the finished sequence groups.
|
||||
|
||||
- Finally, it creates and returns the newly generated results.
|
||||
- Finally, it creates and returns the newly generated results.
|
||||
|
||||
Example:
|
||||
>>> # Please see the example/ folder for more detailed examples.
|
||||
>>>
|
||||
>>> # initialize engine and request arguments
|
||||
>>> engine = LLMEngine.from_engine_args(engine_args)
|
||||
>>> example_inputs = [(0, "What is LLM?",
|
||||
>>> SamplingParams(temperature=0.0))]
|
||||
>>>
|
||||
>>> # Start the engine with an event loop
|
||||
>>> while True:
|
||||
>>> if example_inputs:
|
||||
>>> req_id, prompt, sampling_params = example_inputs.pop(0)
|
||||
>>> engine.add_request(str(req_id),prompt,sampling_params)
|
||||
>>>
|
||||
>>> # continue the request processing
|
||||
>>> request_outputs = engine.step()
|
||||
>>> for request_output in request_outputs:
|
||||
>>> if request_output.finished:
|
||||
>>> # return or show the request output
|
||||
>>>
|
||||
>>> if not (engine.has_unfinished_requests() or example_inputs):
|
||||
>>> break
|
||||
```
|
||||
# Please see the example/ folder for more detailed examples.
|
||||
|
||||
# initialize engine and request arguments
|
||||
engine = LLMEngine.from_engine_args(engine_args)
|
||||
example_inputs = [(0, "What is LLM?",
|
||||
SamplingParams(temperature=0.0))]
|
||||
|
||||
# Start the engine with an event loop
|
||||
while True:
|
||||
if example_inputs:
|
||||
req_id, prompt, sampling_params = example_inputs.pop(0)
|
||||
engine.add_request(str(req_id),prompt,sampling_params)
|
||||
|
||||
# continue the request processing
|
||||
request_outputs = engine.step()
|
||||
for request_output in request_outputs:
|
||||
if request_output.finished:
|
||||
# return or show the request output
|
||||
|
||||
if not (engine.has_unfinished_requests() or example_inputs):
|
||||
break
|
||||
```
|
||||
"""
|
||||
if self.parallel_config.pipeline_parallel_size > 1:
|
||||
raise NotImplementedError(
|
||||
|
||||
Reference in New Issue
Block a user