[Docs] Fix syntax highlighting of shell commands (#19870)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This commit is contained in:
@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
|
||||
```
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@ title: AutoGen
|
||||
|
||||
- Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm
|
||||
|
||||
# Install AgentChat and OpenAI client from Extensions
|
||||
@@ -23,7 +23,7 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]"
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model mistralai/Mistral-7B-Instruct-v0.2
|
||||
```
|
||||
|
||||
@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
|
||||
|
||||
To install the Cerebrium client, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install cerebrium
|
||||
cerebrium login
|
||||
```
|
||||
|
||||
Next, create your Cerebrium project, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
cerebrium init vllm-project
|
||||
```
|
||||
|
||||
@@ -58,7 +58,7 @@ Next, let us add our code to handle inference for the LLM of your choice (`mistr
|
||||
|
||||
Then, run the following code to deploy it to the cloud:
|
||||
|
||||
```console
|
||||
```bash
|
||||
cerebrium deploy
|
||||
```
|
||||
|
||||
|
||||
@@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
|
||||
@@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend.
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve Qwen/Qwen1.5-7B-Chat
|
||||
```
|
||||
|
||||
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
|
||||
|
||||
```console
|
||||
```bash
|
||||
git clone https://github.com/langgenius/dify.git
|
||||
cd dify
|
||||
cd docker
|
||||
|
||||
@@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
|
||||
|
||||
To install dstack client, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install "dstack[all]
|
||||
dstack server
|
||||
```
|
||||
|
||||
Next, to configure your dstack project, run:
|
||||
|
||||
```console
|
||||
```bash
|
||||
mkdir -p vllm-dstack
|
||||
cd vllm-dstack
|
||||
dstack init
|
||||
|
||||
@@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
||||
|
||||
- Setup vLLM and Haystack environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm haystack-ai
|
||||
```
|
||||
|
||||
@@ -21,7 +21,7 @@ pip install vllm haystack-ai
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve mistralai/Mistral-7B-Instruct-v0.1
|
||||
```
|
||||
|
||||
|
||||
@@ -22,7 +22,7 @@ Before you begin, ensure that you have the following:
|
||||
|
||||
To install the chart with the release name `test-vllm`:
|
||||
|
||||
```console
|
||||
```bash
|
||||
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
|
||||
```
|
||||
|
||||
@@ -30,7 +30,7 @@ helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f val
|
||||
|
||||
To uninstall the `test-vllm` deployment:
|
||||
|
||||
```console
|
||||
```bash
|
||||
helm uninstall test-vllm --namespace=ns-vllm
|
||||
```
|
||||
|
||||
|
||||
@@ -18,7 +18,7 @@ And LiteLLM supports all models on VLLM.
|
||||
|
||||
- Setup vLLM and litellm environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm litellm
|
||||
```
|
||||
|
||||
@@ -28,7 +28,7 @@ pip install vllm litellm
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
@@ -56,7 +56,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
|
||||
- Start the vLLM server with the supported embedding model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve BAAI/bge-base-en-v1.5
|
||||
```
|
||||
|
||||
|
||||
@@ -7,13 +7,13 @@ title: Open WebUI
|
||||
|
||||
2. Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port):
|
||||
|
||||
```console
|
||||
```bash
|
||||
docker run -d -p 3000:8080 \
|
||||
--name open-webui \
|
||||
-v open-webui:/app/backend/data \
|
||||
|
||||
@@ -15,7 +15,7 @@ Here are the integrations:
|
||||
|
||||
- Setup vLLM and langchain environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install -U vllm \
|
||||
langchain_milvus langchain_openai \
|
||||
langchain_community beautifulsoup4 \
|
||||
@@ -26,14 +26,14 @@ pip install -U vllm \
|
||||
|
||||
- Start the vLLM server with the supported embedding model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start embedding service (port 8000)
|
||||
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
||||
```
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start chat service (port 8001)
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
||||
```
|
||||
@@ -52,7 +52,7 @@ python retrieval_augmented_generation_with_langchain.py
|
||||
|
||||
- Setup vLLM and llamaindex environment
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install vllm \
|
||||
llama-index llama-index-readers-web \
|
||||
llama-index-llms-openai-like \
|
||||
@@ -64,14 +64,14 @@ pip install vllm \
|
||||
|
||||
- Start the vLLM server with the supported embedding model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start embedding service (port 8000)
|
||||
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
||||
```
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
# Start chat service (port 8001)
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
||||
```
|
||||
|
||||
@@ -15,7 +15,7 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
|
||||
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
|
||||
- Check that `sky check` shows clouds or Kubernetes are enabled.
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install skypilot-nightly
|
||||
sky check
|
||||
```
|
||||
@@ -71,7 +71,7 @@ See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypil
|
||||
|
||||
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
|
||||
```
|
||||
|
||||
@@ -83,7 +83,7 @@ Check the output of the command. There will be a shareable gradio link (like the
|
||||
|
||||
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" \
|
||||
sky launch serving.yaml \
|
||||
--gpus A100:8 \
|
||||
@@ -159,7 +159,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
|
||||
|
||||
Start the serving the Llama-3 8B model on multiple replicas:
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" \
|
||||
sky serve up -n vllm serving.yaml \
|
||||
--env HF_TOKEN
|
||||
@@ -167,7 +167,7 @@ HF_TOKEN="your-huggingface-token" \
|
||||
|
||||
Wait until the service is ready:
|
||||
|
||||
```console
|
||||
```bash
|
||||
watch -n10 sky serve status vllm
|
||||
```
|
||||
|
||||
@@ -271,13 +271,13 @@ This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||
|
||||
To update the service with the new config:
|
||||
|
||||
```console
|
||||
```bash
|
||||
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
|
||||
```
|
||||
|
||||
To stop the service:
|
||||
|
||||
```console
|
||||
```bash
|
||||
sky serve down vllm
|
||||
```
|
||||
|
||||
@@ -317,7 +317,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
|
||||
|
||||
1. Start the chat web UI:
|
||||
|
||||
```console
|
||||
```bash
|
||||
sky launch \
|
||||
-c gui ./gui.yaml \
|
||||
--env ENDPOINT=$(sky serve status --endpoint vllm)
|
||||
|
||||
@@ -15,13 +15,13 @@ It can be quickly integrated with vLLM as a backend API server, enabling powerfu
|
||||
|
||||
- Start the vLLM server with the supported chat completion model, e.g.
|
||||
|
||||
```console
|
||||
```bash
|
||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||
```
|
||||
|
||||
- Install streamlit and openai:
|
||||
|
||||
```console
|
||||
```bash
|
||||
pip install streamlit openai
|
||||
```
|
||||
|
||||
@@ -29,7 +29,7 @@ pip install streamlit openai
|
||||
|
||||
- Start the streamlit web UI and start to chat:
|
||||
|
||||
```console
|
||||
```bash
|
||||
streamlit run streamlit_openai_chatbot_webserver.py
|
||||
|
||||
# or specify the VLLM_API_BASE or VLLM_API_KEY
|
||||
|
||||
Reference in New Issue
Block a user