Make distinct code and console admonitions so readers are less likely to miss them (#20585)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -97,7 +97,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
||||
flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
|
||||
Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```bash
|
||||
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
||||
|
||||
@@ -30,7 +30,7 @@ python -m vllm.entrypoints.openai.api_server \
|
||||
|
||||
- Call it with AutoGen:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
|
||||
@@ -34,7 +34,7 @@ vllm = "latest"
|
||||
|
||||
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
@@ -64,7 +64,7 @@ cerebrium deploy
|
||||
|
||||
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```python
|
||||
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
|
||||
@@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference
|
||||
|
||||
You should get a response like:
|
||||
|
||||
??? Response
|
||||
??? console "Response"
|
||||
|
||||
```python
|
||||
{
|
||||
|
||||
@@ -26,7 +26,7 @@ dstack init
|
||||
|
||||
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
|
||||
|
||||
??? Config
|
||||
??? code "Config"
|
||||
|
||||
```yaml
|
||||
type: service
|
||||
@@ -48,7 +48,7 @@ Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-
|
||||
|
||||
Then, run the following CLI for provisioning:
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```console
|
||||
$ dstack run . -f serve.dstack.yml
|
||||
@@ -79,7 +79,7 @@ Then, run the following CLI for provisioning:
|
||||
|
||||
After the provisioning, you can interact with the model by using the OpenAI SDK:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
@@ -27,7 +27,7 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1
|
||||
|
||||
- Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
from haystack.components.generators.chat import OpenAIChatGenerator
|
||||
|
||||
@@ -34,7 +34,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
|
||||
- Call it with litellm:
|
||||
|
||||
??? Code
|
||||
??? code
|
||||
|
||||
```python
|
||||
import litellm
|
||||
|
||||
@@ -17,7 +17,7 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber
|
||||
|
||||
Deploy the following yaml file `lws.yaml`
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
apiVersion: leaderworkerset.x-k8s.io/v1
|
||||
@@ -177,7 +177,7 @@ curl http://localhost:8080/v1/completions \
|
||||
|
||||
The output should be similar to the following
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```text
|
||||
{
|
||||
|
||||
@@ -24,7 +24,7 @@ sky check
|
||||
|
||||
See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
resources:
|
||||
@@ -95,7 +95,7 @@ HF_TOKEN="your-huggingface-token" \
|
||||
|
||||
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@@ -111,7 +111,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
|
||||
max_completion_tokens: 1
|
||||
```
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@@ -186,7 +186,7 @@ vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) R
|
||||
|
||||
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```bash
|
||||
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
|
||||
@@ -220,7 +220,7 @@ service:
|
||||
|
||||
This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
service:
|
||||
@@ -285,7 +285,7 @@ sky serve down vllm
|
||||
|
||||
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
envs:
|
||||
|
||||
@@ -60,7 +60,7 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
|
||||
curl -o- http://localhost:30080/models
|
||||
```
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -89,7 +89,7 @@ curl -X POST http://localhost:30080/completions \
|
||||
}'
|
||||
```
|
||||
|
||||
??? Output
|
||||
??? console "Output"
|
||||
|
||||
```json
|
||||
{
|
||||
@@ -121,7 +121,7 @@ sudo helm uninstall vllm
|
||||
|
||||
The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:
|
||||
|
||||
??? Yaml
|
||||
??? code "Yaml"
|
||||
|
||||
```yaml
|
||||
servingEngineSpec:
|
||||
|
||||
@@ -29,7 +29,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
|
||||
|
||||
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
|
||||
|
||||
??? Config
|
||||
??? console "Config"
|
||||
|
||||
```bash
|
||||
cat <<EOF |kubectl apply -f -
|
||||
@@ -57,7 +57,7 @@ First, create a Kubernetes PVC and Secret for downloading and storing Hugging Fa
|
||||
|
||||
Next, start the vLLM server as a Kubernetes Deployment and Service:
|
||||
|
||||
??? Config
|
||||
??? console "Config"
|
||||
|
||||
```bash
|
||||
cat <<EOF |kubectl apply -f -
|
||||
|
||||
@@ -36,7 +36,7 @@ docker build . -f Dockerfile.nginx --tag nginx-lb
|
||||
|
||||
Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
|
||||
|
||||
??? Config
|
||||
??? console "Config"
|
||||
|
||||
```console
|
||||
upstream backend {
|
||||
@@ -95,7 +95,7 @@ Notes:
|
||||
- The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
|
||||
- Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```console
|
||||
mkdir -p ~/.cache/huggingface/hub/
|
||||
|
||||
Reference in New Issue
Block a user