[Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -16,10 +16,10 @@ Currently, there are no pre-built ROCm wheels.
|
||||
However, the [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
|
||||
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
|
||||
for instructions on how to use this prebuilt docker image.
|
||||
```
|
||||
:::
|
||||
|
||||
### Build wheel from source
|
||||
|
||||
@@ -47,9 +47,9 @@ for instructions on how to use this prebuilt docker image.
|
||||
cd ../..
|
||||
```
|
||||
|
||||
```{note}
|
||||
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
|
||||
```
|
||||
:::{note}
|
||||
If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
|
||||
:::
|
||||
|
||||
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile)
|
||||
|
||||
@@ -67,9 +67,9 @@ for instructions on how to use this prebuilt docker image.
|
||||
cd ..
|
||||
```
|
||||
|
||||
```{note}
|
||||
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
|
||||
```
|
||||
:::{note}
|
||||
You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
|
||||
:::
|
||||
|
||||
3. Build vLLM. For example, vLLM on ROCM 6.2 can be built with the following steps:
|
||||
|
||||
@@ -95,17 +95,18 @@ for instructions on how to use this prebuilt docker image.
|
||||
|
||||
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
|
||||
|
||||
```{tip}
|
||||
<!--- pyml disable-num-lines 5 ul-indent-->
|
||||
:::{tip}
|
||||
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
|
||||
- Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
|
||||
- To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
|
||||
- The ROCm version of PyTorch, ideally, should match the ROCm driver version.
|
||||
```
|
||||
:::
|
||||
|
||||
```{tip}
|
||||
:::{tip}
|
||||
- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
|
||||
For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
|
||||
```
|
||||
:::
|
||||
|
||||
## Set up using Docker
|
||||
|
||||
|
||||
Reference in New Issue
Block a user