Make distinct code and console admonitions so readers are less likely to miss them (#20585)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-07-08 03:55:28 +01:00
parent 31c5d0a1b7
commit af107d5a0e
52 changed files with 192 additions and 162 deletions
--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@@ -97,7 +97,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
    flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
    Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).

-??? Command
+??? console "Command"

    ```bash
    # Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
--- a/docs/deployment/frameworks/autogen.md
+++ b/docs/deployment/frameworks/autogen.md
@@ -30,7 +30,7 @@ python -m vllm.entrypoints.openai.api_server \

 - Call it with AutoGen:

-??? Code
+??? code

    ```python
    import asyncio
--- a/docs/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -34,7 +34,7 @@ vllm = "latest"

 Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:

-??? Code
+??? code

    ```python
    from vllm import LLM, SamplingParams
@@ -64,7 +64,7 @@ cerebrium deploy

 If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)

-??? Command
+??? console "Command"

    ```python
    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
@@ -82,7 +82,7 @@ If successful, you should be returned a CURL command that you can call inference

 You should get a response like:

-??? Response
+??? console "Response"

    ```python
    {
--- a/docs/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@@ -26,7 +26,7 @@ dstack init

 Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

-??? Config
+??? code "Config"

    ```yaml
    type: service
@@ -48,7 +48,7 @@ Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-

 Then, run the following CLI for provisioning:

-??? Command
+??? console "Command"

    ```console
    $ dstack run . -f serve.dstack.yml
@@ -79,7 +79,7 @@ Then, run the following CLI for provisioning:

 After the provisioning, you can interact with the model by using the OpenAI SDK:

-??? Code
+??? code

    ```python
    from openai import OpenAI
--- a/docs/deployment/frameworks/haystack.md
+++ b/docs/deployment/frameworks/haystack.md
@@ -27,7 +27,7 @@ vllm serve mistralai/Mistral-7B-Instruct-v0.1

 - Use the `OpenAIGenerator` and `OpenAIChatGenerator` components in Haystack to query the vLLM server.

-??? Code
+??? code

    ```python
    from haystack.components.generators.chat import OpenAIChatGenerator
--- a/docs/deployment/frameworks/litellm.md
+++ b/docs/deployment/frameworks/litellm.md
@@ -34,7 +34,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat

 - Call it with litellm:

-??? Code
+??? code

    ```python
    import litellm 
--- a/docs/deployment/frameworks/lws.md
+++ b/docs/deployment/frameworks/lws.md
@@ -17,7 +17,7 @@ vLLM can be deployed with [LWS](https://github.com/kubernetes-sigs/lws) on Kuber

 Deploy the following yaml file `lws.yaml`

-??? Yaml
+??? code "Yaml"

    ```yaml
    apiVersion: leaderworkerset.x-k8s.io/v1
@@ -177,7 +177,7 @@ curl http://localhost:8080/v1/completions \

 The output should be similar to the following

-??? Output
+??? console "Output"

    ```text
    {
--- a/docs/deployment/frameworks/skypilot.md
+++ b/docs/deployment/frameworks/skypilot.md
@@ -24,7 +24,7 @@ sky check

 See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml).

-??? Yaml
+??? code "Yaml"

    ```yaml
    resources:
@@ -95,7 +95,7 @@ HF_TOKEN="your-huggingface-token" \

 SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file.

-??? Yaml
+??? code "Yaml"

    ```yaml
    service:
@@ -111,7 +111,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
      max_completion_tokens: 1
    ```

-??? Yaml
+??? code "Yaml"

    ```yaml
    service:
@@ -186,7 +186,7 @@ vllm          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'L4': 1})  R

 After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:

-??? Commands
+??? console "Commands"

    ```bash
    ENDPOINT=$(sky serve status --endpoint 8081 vllm)
@@ -220,7 +220,7 @@ service:

 This will scale the service up to when the QPS exceeds 2 for each replica.

-??? Yaml
+??? code "Yaml"

    ```yaml
    service:
@@ -285,7 +285,7 @@ sky serve down vllm

 It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.

-??? Yaml
+??? code "Yaml"

    ```yaml
    envs:
--- a/docs/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -60,7 +60,7 @@ And then you can send out a query to the OpenAI-compatible API to check the avai
 curl -o- http://localhost:30080/models
 ```

-??? Output
+??? console "Output"

    ```json
    {
@@ -89,7 +89,7 @@ curl -X POST http://localhost:30080/completions \
  }'
 ```

-??? Output
+??? console "Output"

    ```json
    {
@@ -121,7 +121,7 @@ sudo helm uninstall vllm

 The core vLLM production stack configuration is managed with YAML. Here is the example configuration used in the installation above:

-??? Yaml
+??? code "Yaml"

    ```yaml
    servingEngineSpec:
--- a/docs/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@@ -29,7 +29,7 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:

 First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:

-??? Config
+??? console "Config"

    ```bash
    cat <<EOF |kubectl apply -f -
@@ -57,7 +57,7 @@ First, create a Kubernetes PVC and Secret for downloading and storing Hugging Fa

 Next, start the vLLM server as a Kubernetes Deployment and Service:

-??? Config
+??? console "Config"

    ```bash
    cat <<EOF |kubectl apply -f -
--- a/docs/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@@ -36,7 +36,7 @@ docker build . -f Dockerfile.nginx --tag nginx-lb

 Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.

-??? Config
+??? console "Config"

    ```console
    upstream backend {
@@ -95,7 +95,7 @@ Notes:
 - The below example assumes GPU backend used. If you are using CPU backend, remove `--gpus device=ID`, add `VLLM_CPU_KVCACHE_SPACE` and `VLLM_CPU_OMP_THREADS_BIND` environment variables to the docker run command.
 - Adjust the model name that you want to use in your vLLM servers if you don't want to use `Llama-2-7b-chat-hf`.

-??? Commands
+??? console "Commands"

    ```console
    mkdir -p ~/.cache/huggingface/hub/