[Doc] Convert docs to use colon fences (#12471)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-29 03:38:29 +00:00
parent a7e3eba66f
commit dd6a3a02cb
68 changed files with 2352 additions and 2341 deletions
--- a/docs/source/design/arch_overview.md
+++ b/docs/source/design/arch_overview.md
@@ -4,19 +4,19 @@

 This document provides an overview of the vLLM architecture.

-```{contents} Table of Contents
+:::{contents} Table of Contents
 :depth: 2
 :local: true
-```
+:::

 ## Entrypoints

 vLLM provides a number of entrypoints for interacting with the system. The
 following diagram shows the relationship between them.

-```{image} /assets/design/arch_overview/entrypoints.excalidraw.png
+:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
 :alt: Entrypoints Diagram
-```
+:::

 ### LLM Class

@@ -84,9 +84,9 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o
 The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
 the vLLM system, handling model inference and asynchronous request processing.

-```{image} /assets/design/arch_overview/llm_engine.excalidraw.png
+:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
 :alt: LLMEngine Diagram
-```
+:::

 ### LLMEngine

@@ -144,11 +144,11 @@ configurations affect the class we ultimately get.

 The following figure shows the class hierarchy of vLLM:

-> ```{figure} /assets/design/hierarchy.png
+> :::{figure} /assets/design/hierarchy.png
 > :align: center
 > :alt: query
 > :width: 100%
-> ```
+> :::

 There are several important design choices behind this class hierarchy:

@@ -178,7 +178,7 @@ of a vision model and a language model. By making the constructor uniform, we
 can easily create a vision model and a language model and compose them into a
 vision-language model.

-````{note}
+:::{note}
 To support this change, all vLLM models' signatures have been updated to:

 ```python
@@ -215,7 +215,7 @@ else:
 ```

 This way, the model can work with both old and new versions of vLLM.
-````
+:::

 3\. **Sharding and Quantization at Initialization**: Certain features require
 changing the model weights. For example, tensor parallelism needs to shard the
--- a/docs/source/design/kernel/paged_attention.md
+++ b/docs/source/design/kernel/paged_attention.md
@@ -139,26 +139,26 @@
  const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
  ```

-  ```{figure} ../../assets/kernel/query.png
+  :::{figure} ../../assets/kernel/query.png
  :align: center
  :alt: query
  :width: 70%

  Query data of one token at one head
-  ```
+  :::

 - Each thread defines its own `q_ptr` which points to the assigned
  query token data on global memory. For example, if `VEC_SIZE` is 4
  and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
  total of 128 elements divided into 128 / 4 = 32 vecs.

-  ```{figure} ../../assets/kernel/q_vecs.png
+  :::{figure} ../../assets/kernel/q_vecs.png
  :align: center
  :alt: q_vecs
  :width: 70%

  `q_vecs` for one thread group
-  ```
+  :::

  ```cpp
  __shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
@@ -195,13 +195,13 @@
  points to key token data based on `k_cache` at assigned block,
  assigned head and assigned token.

-  ```{figure} ../../assets/kernel/key.png
+  :::{figure} ../../assets/kernel/key.png
  :align: center
  :alt: key
  :width: 70%

  Key data of all context tokens at one head
-  ```
+  :::

 - The diagram above illustrates the memory layout for key data. It
  assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
@@ -214,13 +214,13 @@
  elements for one token) that will be processed by 2 threads (one
  thread group) separately.

-  ```{figure} ../../assets/kernel/k_vecs.png
+  :::{figure} ../../assets/kernel/k_vecs.png
  :align: center
  :alt: k_vecs
  :width: 70%

  `k_vecs` for one thread
-  ```
+  :::

  ```cpp
  K_vec k_vecs[NUM_VECS_PER_THREAD]
@@ -289,14 +289,14 @@
  should be performed across the entire thread block, encompassing
  results between the query token and all context key tokens.

-  ```{math}
+  :::{math}
  :nowrap: true

  \begin{gather*}
  m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
  \quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
  \end{gather*}
-  ```
+  :::

 ### `qk_max` and `logits`

@@ -379,29 +379,29 @@

 ## Value

-```{figure} ../../assets/kernel/value.png
+:::{figure} ../../assets/kernel/value.png
 :align: center
 :alt: value
 :width: 70%

 Value data of all context tokens at one head
-```
+:::

-```{figure} ../../assets/kernel/logits_vec.png
+:::{figure} ../../assets/kernel/logits_vec.png
 :align: center
 :alt: logits_vec
 :width: 50%

 `logits_vec` for one thread
-```
+:::

-```{figure} ../../assets/kernel/v_vec.png
+:::{figure} ../../assets/kernel/v_vec.png
 :align: center
 :alt: v_vec
 :width: 70%

 List of `v_vec` for one thread
-```
+:::

 - Now we need to retrieve the value data and perform dot multiplication
  with `logits`. Unlike query and key, there is no thread group
--- a/docs/source/design/multiprocessing.md
+++ b/docs/source/design/multiprocessing.md
@@ -7,9 +7,9 @@ page for information on known issues and how to solve them.

 ## Introduction

-```{important}
+:::{important}
 The source code references are to the state of the code at the time of writing in December, 2024.
-```
+:::

 The use of Python multiprocessing in vLLM is complicated by: