[Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -2,11 +2,11 @@
|
||||
|
||||
# Cerebrium
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<p align="center">
|
||||
<img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
|
||||
</p>
|
||||
```
|
||||
:::
|
||||
|
||||
vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
|
||||
|
||||
|
||||
@@ -2,11 +2,11 @@
|
||||
|
||||
# dstack
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<p align="center">
|
||||
<img src="https://i.ibb.co/71kx6hW/vllm-dstack.png" alt="vLLM_plus_dstack"/>
|
||||
</p>
|
||||
```
|
||||
:::
|
||||
|
||||
vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
|
||||
|
||||
@@ -97,6 +97,6 @@ completion = client.chat.completions.create(
|
||||
print(completion.choices[0].message.content)
|
||||
```
|
||||
|
||||
```{note}
|
||||
:::{note}
|
||||
dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
|
||||
```
|
||||
:::
|
||||
|
||||
@@ -38,213 +38,213 @@ chart **including persistent volumes** and deletes the release.
|
||||
|
||||
## Architecture
|
||||
|
||||
```{image} /assets/deployment/architecture_helm_deployment.png
|
||||
```
|
||||
:::{image} /assets/deployment/architecture_helm_deployment.png
|
||||
:::
|
||||
|
||||
## Values
|
||||
|
||||
```{list-table}
|
||||
:::{list-table}
|
||||
:widths: 25 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - Key
|
||||
- Type
|
||||
- Default
|
||||
- Description
|
||||
* - autoscaling
|
||||
- object
|
||||
- {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
|
||||
- Autoscaling configuration
|
||||
* - autoscaling.enabled
|
||||
- bool
|
||||
- false
|
||||
- Enable autoscaling
|
||||
* - autoscaling.maxReplicas
|
||||
- int
|
||||
- 100
|
||||
- Maximum replicas
|
||||
* - autoscaling.minReplicas
|
||||
- int
|
||||
- 1
|
||||
- Minimum replicas
|
||||
* - autoscaling.targetCPUUtilizationPercentage
|
||||
- int
|
||||
- 80
|
||||
- Target CPU utilization for autoscaling
|
||||
* - configs
|
||||
- object
|
||||
- {}
|
||||
- Configmap
|
||||
* - containerPort
|
||||
- int
|
||||
- 8000
|
||||
- Container port
|
||||
* - customObjects
|
||||
- list
|
||||
- []
|
||||
- Custom Objects configuration
|
||||
* - deploymentStrategy
|
||||
- object
|
||||
- {}
|
||||
- Deployment strategy configuration
|
||||
* - externalConfigs
|
||||
- list
|
||||
- []
|
||||
- External configuration
|
||||
* - extraContainers
|
||||
- list
|
||||
- []
|
||||
- Additional containers configuration
|
||||
* - extraInit
|
||||
- object
|
||||
- {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
|
||||
- Additional configuration for the init container
|
||||
* - extraInit.pvcStorage
|
||||
- string
|
||||
- "50Gi"
|
||||
- Storage size of the s3
|
||||
* - extraInit.s3modelpath
|
||||
- string
|
||||
- "relative_s3_model_path/opt-125m"
|
||||
- Path of the model on the s3 which hosts model weights and config files
|
||||
* - extraInit.awsEc2MetadataDisabled
|
||||
- boolean
|
||||
- true
|
||||
- Disables the use of the Amazon EC2 instance metadata service
|
||||
* - extraPorts
|
||||
- list
|
||||
- []
|
||||
- Additional ports configuration
|
||||
* - gpuModels
|
||||
- list
|
||||
- ["TYPE_GPU_USED"]
|
||||
- Type of gpu used
|
||||
* - image
|
||||
- object
|
||||
- {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
|
||||
- Image configuration
|
||||
* - image.command
|
||||
- list
|
||||
- ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
|
||||
- Container launch command
|
||||
* - image.repository
|
||||
- string
|
||||
- "vllm/vllm-openai"
|
||||
- Image repository
|
||||
* - image.tag
|
||||
- string
|
||||
- "latest"
|
||||
- Image tag
|
||||
* - livenessProbe
|
||||
- object
|
||||
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
|
||||
- Liveness probe configuration
|
||||
* - livenessProbe.failureThreshold
|
||||
- int
|
||||
- 3
|
||||
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
|
||||
* - livenessProbe.httpGet
|
||||
- object
|
||||
- {"path":"/health","port":8000}
|
||||
- Configuration of the Kubelet http request on the server
|
||||
* - livenessProbe.httpGet.path
|
||||
- string
|
||||
- "/health"
|
||||
- Path to access on the HTTP server
|
||||
* - livenessProbe.httpGet.port
|
||||
- int
|
||||
- 8000
|
||||
- Name or number of the port to access on the container, on which the server is listening
|
||||
* - livenessProbe.initialDelaySeconds
|
||||
- int
|
||||
- 15
|
||||
- Number of seconds after the container has started before liveness probe is initiated
|
||||
* - livenessProbe.periodSeconds
|
||||
- int
|
||||
- 10
|
||||
- How often (in seconds) to perform the liveness probe
|
||||
* - maxUnavailablePodDisruptionBudget
|
||||
- string
|
||||
- ""
|
||||
- Disruption Budget Configuration
|
||||
* - readinessProbe
|
||||
- object
|
||||
- {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
|
||||
- Readiness probe configuration
|
||||
* - readinessProbe.failureThreshold
|
||||
- int
|
||||
- 3
|
||||
- Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
|
||||
* - readinessProbe.httpGet
|
||||
- object
|
||||
- {"path":"/health","port":8000}
|
||||
- Configuration of the Kubelet http request on the server
|
||||
* - readinessProbe.httpGet.path
|
||||
- string
|
||||
- "/health"
|
||||
- Path to access on the HTTP server
|
||||
* - readinessProbe.httpGet.port
|
||||
- int
|
||||
- 8000
|
||||
- Name or number of the port to access on the container, on which the server is listening
|
||||
* - readinessProbe.initialDelaySeconds
|
||||
- int
|
||||
- 5
|
||||
- Number of seconds after the container has started before readiness probe is initiated
|
||||
* - readinessProbe.periodSeconds
|
||||
- int
|
||||
- 5
|
||||
- How often (in seconds) to perform the readiness probe
|
||||
* - replicaCount
|
||||
- int
|
||||
- 1
|
||||
- Number of replicas
|
||||
* - resources
|
||||
- object
|
||||
- {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
|
||||
- Resource configuration
|
||||
* - resources.limits."nvidia.com/gpu"
|
||||
- int
|
||||
- 1
|
||||
- Number of gpus used
|
||||
* - resources.limits.cpu
|
||||
- int
|
||||
- 4
|
||||
- Number of CPUs
|
||||
* - resources.limits.memory
|
||||
- string
|
||||
- "16Gi"
|
||||
- CPU memory configuration
|
||||
* - resources.requests."nvidia.com/gpu"
|
||||
- int
|
||||
- 1
|
||||
- Number of gpus used
|
||||
* - resources.requests.cpu
|
||||
- int
|
||||
- 4
|
||||
- Number of CPUs
|
||||
* - resources.requests.memory
|
||||
- string
|
||||
- "16Gi"
|
||||
- CPU memory configuration
|
||||
* - secrets
|
||||
- object
|
||||
- {}
|
||||
- Secrets configuration
|
||||
* - serviceName
|
||||
- string
|
||||
-
|
||||
- Service name
|
||||
* - servicePort
|
||||
- int
|
||||
- 80
|
||||
- Service port
|
||||
* - labels.environment
|
||||
- string
|
||||
- test
|
||||
- Environment name
|
||||
* - labels.release
|
||||
- string
|
||||
- test
|
||||
- Release name
|
||||
```
|
||||
- * Key
|
||||
* Type
|
||||
* Default
|
||||
* Description
|
||||
- * autoscaling
|
||||
* object
|
||||
* {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
|
||||
* Autoscaling configuration
|
||||
- * autoscaling.enabled
|
||||
* bool
|
||||
* false
|
||||
* Enable autoscaling
|
||||
- * autoscaling.maxReplicas
|
||||
* int
|
||||
* 100
|
||||
* Maximum replicas
|
||||
- * autoscaling.minReplicas
|
||||
* int
|
||||
* 1
|
||||
* Minimum replicas
|
||||
- * autoscaling.targetCPUUtilizationPercentage
|
||||
* int
|
||||
* 80
|
||||
* Target CPU utilization for autoscaling
|
||||
- * configs
|
||||
* object
|
||||
* {}
|
||||
* Configmap
|
||||
- * containerPort
|
||||
* int
|
||||
* 8000
|
||||
* Container port
|
||||
- * customObjects
|
||||
* list
|
||||
* []
|
||||
* Custom Objects configuration
|
||||
- * deploymentStrategy
|
||||
* object
|
||||
* {}
|
||||
* Deployment strategy configuration
|
||||
- * externalConfigs
|
||||
* list
|
||||
* []
|
||||
* External configuration
|
||||
- * extraContainers
|
||||
* list
|
||||
* []
|
||||
* Additional containers configuration
|
||||
- * extraInit
|
||||
* object
|
||||
* {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
|
||||
* Additional configuration for the init container
|
||||
- * extraInit.pvcStorage
|
||||
* string
|
||||
* "50Gi"
|
||||
* Storage size of the s3
|
||||
- * extraInit.s3modelpath
|
||||
* string
|
||||
* "relative_s3_model_path/opt-125m"
|
||||
* Path of the model on the s3 which hosts model weights and config files
|
||||
- * extraInit.awsEc2MetadataDisabled
|
||||
* boolean
|
||||
* true
|
||||
* Disables the use of the Amazon EC2 instance metadata service
|
||||
- * extraPorts
|
||||
* list
|
||||
* []
|
||||
* Additional ports configuration
|
||||
- * gpuModels
|
||||
* list
|
||||
* ["TYPE_GPU_USED"]
|
||||
* Type of gpu used
|
||||
- * image
|
||||
* object
|
||||
* {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
|
||||
* Image configuration
|
||||
- * image.command
|
||||
* list
|
||||
* ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
|
||||
* Container launch command
|
||||
- * image.repository
|
||||
* string
|
||||
* "vllm/vllm-openai"
|
||||
* Image repository
|
||||
- * image.tag
|
||||
* string
|
||||
* "latest"
|
||||
* Image tag
|
||||
- * livenessProbe
|
||||
* object
|
||||
* {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
|
||||
* Liveness probe configuration
|
||||
- * livenessProbe.failureThreshold
|
||||
* int
|
||||
* 3
|
||||
* Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
|
||||
- * livenessProbe.httpGet
|
||||
* object
|
||||
* {"path":"/health","port":8000}
|
||||
* Configuration of the Kubelet http request on the server
|
||||
- * livenessProbe.httpGet.path
|
||||
* string
|
||||
* "/health"
|
||||
* Path to access on the HTTP server
|
||||
- * livenessProbe.httpGet.port
|
||||
* int
|
||||
* 8000
|
||||
* Name or number of the port to access on the container, on which the server is listening
|
||||
- * livenessProbe.initialDelaySeconds
|
||||
* int
|
||||
* 15
|
||||
* Number of seconds after the container has started before liveness probe is initiated
|
||||
- * livenessProbe.periodSeconds
|
||||
* int
|
||||
* 10
|
||||
* How often (in seconds) to perform the liveness probe
|
||||
- * maxUnavailablePodDisruptionBudget
|
||||
* string
|
||||
* ""
|
||||
* Disruption Budget Configuration
|
||||
- * readinessProbe
|
||||
* object
|
||||
* {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
|
||||
* Readiness probe configuration
|
||||
- * readinessProbe.failureThreshold
|
||||
* int
|
||||
* 3
|
||||
* Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
|
||||
- * readinessProbe.httpGet
|
||||
* object
|
||||
* {"path":"/health","port":8000}
|
||||
* Configuration of the Kubelet http request on the server
|
||||
- * readinessProbe.httpGet.path
|
||||
* string
|
||||
* "/health"
|
||||
* Path to access on the HTTP server
|
||||
- * readinessProbe.httpGet.port
|
||||
* int
|
||||
* 8000
|
||||
* Name or number of the port to access on the container, on which the server is listening
|
||||
- * readinessProbe.initialDelaySeconds
|
||||
* int
|
||||
* 5
|
||||
* Number of seconds after the container has started before readiness probe is initiated
|
||||
- * readinessProbe.periodSeconds
|
||||
* int
|
||||
* 5
|
||||
* How often (in seconds) to perform the readiness probe
|
||||
- * replicaCount
|
||||
* int
|
||||
* 1
|
||||
* Number of replicas
|
||||
- * resources
|
||||
* object
|
||||
* {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
|
||||
* Resource configuration
|
||||
- * resources.limits."nvidia.com/gpu"
|
||||
* int
|
||||
* 1
|
||||
* Number of gpus used
|
||||
- * resources.limits.cpu
|
||||
* int
|
||||
* 4
|
||||
* Number of CPUs
|
||||
- * resources.limits.memory
|
||||
* string
|
||||
* "16Gi"
|
||||
* CPU memory configuration
|
||||
- * resources.requests."nvidia.com/gpu"
|
||||
* int
|
||||
* 1
|
||||
* Number of gpus used
|
||||
- * resources.requests.cpu
|
||||
* int
|
||||
* 4
|
||||
* Number of CPUs
|
||||
- * resources.requests.memory
|
||||
* string
|
||||
* "16Gi"
|
||||
* CPU memory configuration
|
||||
- * secrets
|
||||
* object
|
||||
* {}
|
||||
* Secrets configuration
|
||||
- * serviceName
|
||||
* string
|
||||
*
|
||||
* Service name
|
||||
- * servicePort
|
||||
* int
|
||||
* 80
|
||||
* Service port
|
||||
- * labels.environment
|
||||
* string
|
||||
* test
|
||||
* Environment name
|
||||
- * labels.release
|
||||
* string
|
||||
* test
|
||||
* Release name
|
||||
:::
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Using other frameworks
|
||||
|
||||
```{toctree}
|
||||
:::{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
bentoml
|
||||
@@ -11,4 +11,4 @@ lws
|
||||
modal
|
||||
skypilot
|
||||
triton
|
||||
```
|
||||
:::
|
||||
|
||||
@@ -2,11 +2,11 @@
|
||||
|
||||
# SkyPilot
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<p align="center">
|
||||
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
||||
</p>
|
||||
```
|
||||
:::
|
||||
|
||||
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
|
||||
|
||||
@@ -104,10 +104,10 @@ service:
|
||||
max_completion_tokens: 1
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Click to see the full recipe YAML</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@@ -153,9 +153,9 @@ run: |
|
||||
2>&1 | tee api_server.log
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
Start the serving the Llama-3 8B model on multiple replicas:
|
||||
|
||||
@@ -169,10 +169,10 @@ Wait until the service is ready:
|
||||
watch -n10 sky serve status vllm
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Example outputs:</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```console
|
||||
Services
|
||||
@@ -185,9 +185,9 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R
|
||||
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
|
||||
|
||||
@@ -223,10 +223,10 @@ service:
|
||||
|
||||
This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Click to see the full recipe YAML</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@@ -275,9 +275,9 @@ run: |
|
||||
2>&1 | tee api_server.log
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
To update the service with the new config:
|
||||
|
||||
@@ -295,10 +295,10 @@ sky serve down vllm
|
||||
|
||||
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
<details>
|
||||
<summary>Click to see the full GUI YAML</summary>
|
||||
```
|
||||
:::
|
||||
|
||||
```yaml
|
||||
envs:
|
||||
@@ -328,9 +328,9 @@ run: |
|
||||
--stop-token-ids 128009,128001 | tee ~/gradio.log
|
||||
```
|
||||
|
||||
```{raw} html
|
||||
:::{raw} html
|
||||
</details>
|
||||
```
|
||||
:::
|
||||
|
||||
1. Start the chat web UI:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user