Make distinct code and console admonitions so readers are less likely to miss them (#20585)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -61,7 +61,7 @@ To address the above issues, I have designed and developed a local Tensor memory
|
||||
|
||||
# Install vLLM
|
||||
|
||||
??? Commands
|
||||
??? console "Commands"
|
||||
|
||||
```shell
|
||||
# Enter the home directory or your working directory.
|
||||
@@ -106,7 +106,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
||||
@@ -128,7 +128,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode1 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
||||
@@ -150,7 +150,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode2 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
||||
@@ -172,7 +172,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode3 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
||||
@@ -203,7 +203,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill1 (e.g. 10.0.1.2 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=0 vllm serve {your model directory} \
|
||||
@@ -225,7 +225,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill2 (e.g. 10.0.1.3 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=1 vllm serve {your model directory} \
|
||||
@@ -247,7 +247,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Prefill3 (e.g. 10.0.1.4 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=2 vllm serve {your model directory} \
|
||||
@@ -269,7 +269,7 @@ python3 disagg_prefill_proxy_xpyd.py &
|
||||
|
||||
### Decode1 (e.g. 10.0.1.5 or 10.0.1.1)
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
VLLM_USE_V1=1 CUDA_VISIBLE_DEVICES=3 vllm serve {your model directory} \
|
||||
@@ -304,7 +304,7 @@ curl -X POST -s http://10.0.1.1:10001/v1/completions \
|
||||
|
||||
# Benchmark
|
||||
|
||||
??? Command
|
||||
??? console "Command"
|
||||
|
||||
```shell
|
||||
python3 benchmark_serving.py \
|
||||
|
||||
@@ -28,7 +28,7 @@ A unique aspect of vLLM's `torch.compile` integration, is that we guarantee all
|
||||
|
||||
In the very verbose logs, we can see:
|
||||
|
||||
??? Logs
|
||||
??? console "Logs"
|
||||
|
||||
```text
|
||||
DEBUG 03-07 03:06:52 [decorators.py:203] Start compiling function <code object forward at 0x7f08acf40c90, file "xxx/vllm/model_executor/models/llama.py", line 339>
|
||||
@@ -110,7 +110,7 @@ Then it will also compile a specific kernel just for batch size `1, 2, 4, 8`. At
|
||||
|
||||
When all the shapes are known, `torch.compile` can compare different configs, and often find some better configs to run the kernel. For example, we can see the following log:
|
||||
|
||||
??? Logs
|
||||
??? console "Logs"
|
||||
|
||||
```
|
||||
AUTOTUNE mm(8x2048, 2048x3072)
|
||||
|
||||
Reference in New Issue
Block a user