[Doc][3/N] Reorganize Serving section (#11766)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
Cyrus Leung
2025-01-07 11:20:01 +08:00
committed by GitHub
parent d93d2d74fd
commit 8ceffbf315
40 changed files with 248 additions and 133 deletions

View File

@@ -0,0 +1,9 @@
# External Integrations
```{toctree}
:maxdepth: 1
kserve
kubeai
llamastack
```

View File

@@ -0,0 +1,7 @@
(deployment-kserve)=
# KServe
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
Please see [this guide](https://kserve.github.io/website/latest/modelserving/v1beta1/llm/huggingface/) for more details on using vLLM with KServe.

View File

@@ -0,0 +1,15 @@
(deployment-kubeai)=
# KubeAI
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
Please see the Installation Guides for environment specific instructions:
- [Any Kubernetes Cluster](https://www.kubeai.org/installation/any/)
- [EKS](https://www.kubeai.org/installation/eks/)
- [GKE](https://www.kubeai.org/installation/gke/)
Once you have KubeAI installed, you can
[configure text generation models](https://www.kubeai.org/how-to/configure-text-generation-models/)
using vLLM.

View File

@@ -0,0 +1,38 @@
(deployment-llamastack)=
# Llama Stack
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
To install Llama Stack, run
```console
$ pip install llama-stack -q
```
## Inference using OpenAI Compatible API
Then start Llama Stack server pointing to your vLLM server with the following configuration:
```yaml
inference:
- provider_id: vllm0
provider_type: remote::vllm
config:
url: http://127.0.0.1:8000
```
Please refer to [this guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html) for more details on this remote vLLM provider.
## Inference via Embedded vLLM
An [inline vLLM provider](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/inference/vllm)
is also available. This is a sample of configuration using that method:
```yaml
inference
- provider_type: vllm
config:
model: Llama3.1-8B-Instruct
tensor_parallel_size: 4
```