2025-07-08 11:27:40 +01:00
# Llama Stack
2024-12-23 17:35:38 -05:00
2025-09-08 17:28:32 +08:00
vLLM is also available via [Llama Stack ](https://github.com/llamastack/llama-stack ).
2024-12-23 17:35:38 -05:00
To install Llama Stack, run
2025-06-23 18:59:09 +01:00
```bash
2025-01-12 03:17:13 -05:00
pip install llama-stack -q
2024-12-23 17:35:38 -05:00
```
2025-09-08 17:28:32 +08:00
## Inference using OpenAI-Compatible API
2024-12-23 17:35:38 -05:00
2025-09-08 17:28:32 +08:00
Then start the Llama Stack server and configure it to point to your vLLM server with the following settings:
2024-12-23 17:35:38 -05:00
```yaml
inference:
- provider_id: vllm0
provider_type: remote::vllm
config:
url: http://127.0.0.1:8000
```
2025-09-08 17:28:32 +08:00
Please refer to [this guide ](https://llama-stack.readthedocs.io/en/latest/providers/inference/remote_vllm.html ) for more details on this remote vLLM provider.
2024-12-23 17:35:38 -05:00
2025-09-08 17:28:32 +08:00
## Inference using Embedded vLLM
2024-12-23 17:35:38 -05:00
2025-09-08 17:28:32 +08:00
An [inline provider ](https://github.com/llamastack/llama-stack/tree/main/llama_stack/providers/inline/inference )
2024-12-23 17:35:38 -05:00
is also available. This is a sample of configuration using that method:
```yaml
2025-09-08 17:28:32 +08:00
inference:
2024-12-23 17:35:38 -05:00
- provider_type: vllm
config:
model: Llama3.1-8B-Instruct
tensor_parallel_size: 4
```