[Doc]: fixing multiple typos in diverse files (#33256)
Signed-off-by: Didier Durand <durand.didier@gmail.com> Signed-off-by: Didier Durand <2927957+didier-durand@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This commit is contained in:
@@ -28,7 +28,7 @@ Furthermore, vLLM decides whether to enable or disable a `CustomOp` based on `co
|
||||
!!! note
|
||||
Note that `all` and `none` cannot coexist in `compilation_config.custom_ops`.
|
||||
|
||||
By default, if `compilation_config.backend == "inductor"` and `compilation_config.mode != CompilationMode.NONE`, a `none` will be appended into `compilation_config.custom_ops`, otherwise a `all` will be appended. In other words, this means `CustomOp` will be disabled in some platforms (i.e., those use `inductor` as dafault backend for `torch.compile`) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops.
|
||||
By default, if `compilation_config.backend == "inductor"` and `compilation_config.mode != CompilationMode.NONE`, a `none` will be appended into `compilation_config.custom_ops`, otherwise a `all` will be appended. In other words, this means `CustomOp` will be disabled in some platforms (i.e., those use `inductor` as default backend for `torch.compile`) when running with torch compile mode. In this case, Inductor generates (fused) Triton kernels for those disabled custom ops.
|
||||
|
||||
!!! note
|
||||
For multi-modal models, vLLM has enforced the enabling of some custom ops to use device-specific deep-optimized kernels for better performance in ViT part, such as `MMEncoderAttention` and `ApplyRotaryEmb`. We can also pass a `enforce_enable=True` param to the `__init__()` method of the `CustomOp` to enforce enable itself at object-level.
|
||||
|
||||
@@ -211,7 +211,7 @@ LLM(model, compilation_config=CompilationConfig(
|
||||
These modes are stricter and reduce or eliminate the need of dynamic shapes guarding, which can help isolate issues:
|
||||
|
||||
- `unbacked`: Uses unbacked symints which don't allow guards, making it easier to identify where guards are being incorrectly added
|
||||
- `backed_size_oblivious`: Uses a mode that is more strict about guarding.
|
||||
- `backed_size_oblivious`: Uses a mode that is stricter about guarding.
|
||||
|
||||
For more details on dynamic shapes modes, see [Dynamic shapes and vLLM guard dropping](torch_compile.md#dynamic-shapes-and-vllm-guard-dropping).
|
||||
|
||||
|
||||
@@ -100,7 +100,7 @@ Every plugin has three parts:
|
||||
- `_enum`: This property is the device enumeration from [PlatformEnum][vllm.platforms.interface.PlatformEnum]. Usually, it should be `PlatformEnum.OOT`, which means the platform is out-of-tree.
|
||||
- `device_type`: This property should return the type of the device which pytorch uses. For example, `"cpu"`, `"cuda"`, etc.
|
||||
- `device_name`: This property is set the same as `device_type` usually. It's mainly used for logging purposes.
|
||||
- `check_and_update_config`: This function is called very early in the vLLM's initialization process. It's used for plugins to update the vllm configuration. For example, the block size, graph mode config, etc, can be updated in this function. The most important thing is that the **worker_cls** should be set in this function to let vLLM know which worker class to use for the worker process.
|
||||
- `check_and_update_config`: This function is called very early in the vLLM's initialization process. It's used for plugins to update the vllm configuration. For example, the block size, graph mode config, etc., can be updated in this function. The most important thing is that the **worker_cls** should be set in this function to let vLLM know which worker class to use for the worker process.
|
||||
- `get_attn_backend_cls`: This function should return the attention backend class's fully qualified name.
|
||||
- `get_device_communicator_cls`: This function should return the device communicator class's fully qualified name.
|
||||
|
||||
@@ -126,7 +126,7 @@ Every plugin has three parts:
|
||||
|
||||
5. Implement the attention backend class `MyDummyAttention` in `my_dummy_attention.py`. The attention backend class should inherit from [AttentionBackend][vllm.v1.attention.backend.AttentionBackend]. It's used to calculate attentions with your device. Take `vllm.v1.attention.backends` as examples, it contains many attention backend implementations.
|
||||
|
||||
6. Implement custom ops for high performance. Most ops can be ran by pytorch native implementation, while the performance may not be good. In this case, you can implement specific custom ops for your plugins. Currently, there are kinds of custom ops vLLM supports:
|
||||
6. Implement custom ops for high performance. Most ops can be run by pytorch native implementation, while the performance may not be good. In this case, you can implement specific custom ops for your plugins. Currently, there are kinds of custom ops vLLM supports:
|
||||
|
||||
- pytorch ops
|
||||
there are 3 kinds of pytorch ops:
|
||||
|
||||
@@ -327,7 +327,7 @@ curl -s http://localhost:8000/pooling -H "Content-Type: application/json" -d '{
|
||||
}'
|
||||
```
|
||||
|
||||
Due to limitations in the the output schema, the output consists of a list of
|
||||
Due to limitations in the output schema, the output consists of a list of
|
||||
token scores for each token for each input. This means that you'll have to call
|
||||
`/tokenize` as well to be able to pair tokens with scores.
|
||||
Refer to the tests in `tests/models/language/pooling/test_bge_m3.py` to see how
|
||||
|
||||
@@ -9,7 +9,7 @@ Context parallel mainly solves the problem of serving long context requests. As
|
||||
|
||||
During prefill, for a long request with `T` new tokens, we need to compute query/key/value tensors for these new tokens. Say we have `N` GPUs, we can split the request into `N` chunks, and each GPU computes one chunk of the query/key/value tensors.
|
||||
|
||||
Depending on the use case, there're two possible strategies:
|
||||
Depending on the use case, there are two possible strategies:
|
||||
|
||||
1. Partial query, full key/value: If the request token length is moderately long (we can afford holding the full key/value tensors), and the goal is to accelerate the prefill (and amortize the computation time of the prefill across query tokens), then we can gather the key/value tensors from all GPUs and let each GPU compute the attention output corresponding to the query tokens of its chunk.
|
||||
2. Partial query, partial key/value: If the request token length is too long, we cannot afford holding the full key/value tensors anymore, then we can only compute one chunk of query/key/value tensors for each GPU, and use techniques like [ring-attention](http://arxiv.org/abs/2310.01889) to send/recv key/value tensors chunk by chunk.
|
||||
|
||||
Reference in New Issue
Block a user