[Doc][3/N] Reorganize Serving section (#11766)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
9
docs/source/deployment/integrations/index.md
Normal file
9
docs/source/deployment/integrations/index.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# External Integrations
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
kserve
|
||||
kubeai
|
||||
llamastack
|
||||
```
|
||||
7
docs/source/deployment/integrations/kserve.md
Normal file
7
docs/source/deployment/integrations/kserve.md
Normal file
@@ -0,0 +1,7 @@
|
||||
(deployment-kserve)=
|
||||
|
||||
# KServe
|
||||
|
||||
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
|
||||
|
||||
Please see [this guide](https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/) for more details on using vLLM with KServe.
|
||||
15
docs/source/deployment/integrations/kubeai.md
Normal file
15
docs/source/deployment/integrations/kubeai.md
Normal file
@@ -0,0 +1,15 @@
|
||||
(deployment-kubeai)=
|
||||
|
||||
# KubeAI
|
||||
|
||||
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
|
||||
|
||||
Please see the Installation Guides for environment specific instructions:
|
||||
|
||||
- [Any Kubernetes Cluster](https://www.kubeai.org/installation/any/)
|
||||
- [EKS](https://www.kubeai.org/installation/eks/)
|
||||
- [GKE](https://www.kubeai.org/installation/gke/)
|
||||
|
||||
Once you have KubeAI installed, you can
|
||||
[configure text generation models](https://www.kubeai.org/how-to/configure-text-generation-models/)
|
||||
using vLLM.
|
||||
38
docs/source/deployment/integrations/llamastack.md
Normal file
38
docs/source/deployment/integrations/llamastack.md
Normal file
@@ -0,0 +1,38 @@
|
||||
(deployment-llamastack)=
|
||||
|
||||
# Llama Stack
|
||||
|
||||
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
|
||||
|
||||
To install Llama Stack, run
|
||||
|
||||
```console
|
||||
$ pip install llama-stack -q
|
||||
```
|
||||
|
||||
## Inference using OpenAI Compatible API
|
||||
|
||||
Then start Llama Stack server pointing to your vLLM server with the following configuration:
|
||||
|
||||
```yaml
|
||||
inference:
|
||||
- provider_id: vllm0
|
||||
provider_type: remote::vllm
|
||||
config:
|
||||
url: http://127.0.0.1:8000
|
||||
```
|
||||
|
||||
Please refer to [this guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html) for more details on this remote vLLM provider.
|
||||
|
||||
## Inference via Embedded vLLM
|
||||
|
||||
An [inline vLLM provider](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm)
|
||||
is also available. This is a sample of configuration using that method:
|
||||
|
||||
```yaml
|
||||
inference
|
||||
- provider_type: vllm
|
||||
config:
|
||||
model: Llama3.1-8B-Instruct
|
||||
tensor_parallel_size: 4
|
||||
```
|
||||
Reference in New Issue
Block a user