2025-06-17 09:05:29 -04:00
# AWS Neuron
2024-12-23 17:35:38 -05:00
2025-06-11 12:11:17 -04:00
[AWS Neuron ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/ ) is the software development kit (SDK) used to run deep learning and
2025-06-17 09:05:29 -04:00
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
This describes how to set up your environment to run vLLM on Neuron.
2024-12-23 17:35:38 -05:00
2025-05-23 11:09:53 +02:00
!!! warning
There are no pre-built wheels or images for this device, so you must build vLLM from source.
2025-01-31 23:38:35 +00:00
2025-06-17 09:05:29 -04:00
## Requirements
2024-12-23 17:35:38 -05:00
- OS: Linux
2025-05-28 19:44:01 -07:00
- Python: 3.9 or newer
- Pytorch 2.5/2.6
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
- AWS Neuron SDK 2.23
2024-12-23 17:35:38 -05:00
2025-06-17 09:05:29 -04:00
## Configure a new environment
2024-12-23 17:35:38 -05:00
2025-05-28 19:44:01 -07:00
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
2024-12-23 17:35:38 -05:00
2025-06-11 12:11:17 -04:00
The easiest way to launch a Trainium or Inferentia instance with pre-installed Neuron dependencies is to follow this
2025-05-28 19:44:01 -07:00
[quick start guide ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/multiframework/multi-framework-ubuntu22-neuron-dlami.html#setup-ubuntu22-multi-framework-dlami ) using the Neuron Deep Learning AMI (Amazon machine image).
2024-12-23 17:35:38 -05:00
- After launching the instance, follow the instructions in [Connect to your instance ](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html ) to connect to the instance
2025-05-28 19:44:01 -07:00
- Once inside your instance, activate the pre-installed virtual environment for inference by running
2025-06-17 09:05:29 -04:00
2025-06-23 18:59:09 +01:00
```bash
2025-05-28 19:44:01 -07:00
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
```
2024-12-23 17:35:38 -05:00
2025-06-11 12:11:17 -04:00
Refer to the [NxD Inference Setup Guide ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html )
2025-05-28 19:44:01 -07:00
for alternative setup instructions including using Docker and manually installing dependencies.
2024-12-23 17:35:38 -05:00
2025-05-28 19:44:01 -07:00
!!! note
2025-06-11 12:11:17 -04:00
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx ](https://github.com/aws-neuron/transformers-neuronx )
library, refer to [Transformers NeuronX Setup ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html ).
2024-12-23 17:35:38 -05:00
2025-06-17 09:05:29 -04:00
## Set up using Python
2025-01-13 12:27:36 +00:00
2025-06-17 09:05:29 -04:00
### Pre-built wheels
2024-12-23 17:35:38 -05:00
2025-01-13 12:27:36 +00:00
Currently, there are no pre-built Neuron wheels.
2025-06-17 09:05:29 -04:00
### Build wheel from source
2024-12-23 17:35:38 -05:00
2025-06-17 09:05:29 -04:00
To build and install vLLM from source, run:
2024-12-23 17:35:38 -05:00
2025-06-23 18:59:09 +01:00
```bash
2025-05-28 19:44:01 -07:00
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -U -r requirements/neuron.txt
VLLM_TARGET_DEVICE="neuron" pip install -e .
2024-12-23 17:35:38 -05:00
```
2025-06-11 12:11:17 -04:00
AWS Neuron maintains a [Github fork of vLLM ](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2 ) at
2025-06-17 09:05:29 -04:00
<https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2>, which contains several features in addition to what's
available on vLLM V0. Please utilize the AWS Fork for the following features:
2025-05-28 19:44:01 -07:00
- Llama-3.2 multi-modal support
2025-06-11 12:11:17 -04:00
- Multi-node distributed inference
2024-12-23 17:35:38 -05:00
2025-06-11 12:11:17 -04:00
Refer to [vLLM User Guide for NxD Inference ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html )
2025-05-28 19:44:01 -07:00
for more details and usage examples.
To install the AWS Neuron fork, run the following:
2024-12-23 17:35:38 -05:00
2025-06-23 18:59:09 +01:00
```bash
2025-05-28 19:44:01 -07:00
git clone -b neuron-2.23-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git
cd upstreaming-to-vllm
pip install -r requirements/neuron.txt
VLLM_TARGET_DEVICE="neuron" pip install -e .
2024-12-23 17:35:38 -05:00
```
2025-05-28 19:44:01 -07:00
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
2025-01-13 12:27:36 +00:00
2025-06-17 09:05:29 -04:00
## Set up using Docker
2025-01-13 12:27:36 +00:00
2025-06-17 09:05:29 -04:00
### Pre-built images
2025-01-13 12:27:36 +00:00
Currently, there are no pre-built Neuron images.
2025-06-17 09:05:29 -04:00
### Build image from source
2025-01-13 12:27:36 +00:00
2025-05-23 11:09:53 +02:00
See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
2025-01-13 12:27:36 +00:00
2025-03-31 21:47:32 +01:00
Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dockerfile.
2025-01-13 12:27:36 +00:00
2025-06-17 09:05:29 -04:00
## Extra information
2025-01-13 12:27:36 +00:00
2025-05-28 19:44:01 -07:00
[](){ #feature -support-through-nxd-inference-backend }
2025-06-17 09:05:29 -04:00
2025-05-28 19:44:01 -07:00
### Feature support through NxD Inference backend
2025-06-11 12:11:17 -04:00
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most
[features supported on Neuron ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html ) are also available via the vLLM integration.
2025-05-28 19:44:01 -07:00
2025-06-11 12:11:17 -04:00
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
2025-05-28 19:44:01 -07:00
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
2025-06-17 09:05:29 -04:00
2025-06-23 18:59:09 +01:00
```python
2025-05-28 19:44:01 -07:00
override_neuron_config={
"enable_bucketing":False,
}
```
2025-06-17 09:05:29 -04:00
2025-05-28 19:44:01 -07:00
or when launching vLLM from the CLI, pass
2025-06-17 09:05:29 -04:00
2025-06-23 18:59:09 +01:00
```bash
2025-05-28 19:44:01 -07:00
--override-neuron-config "{\"enable_bucketing\":false}"
```
2025-06-11 12:11:17 -04:00
Alternatively, users can directly call the NxDI library to trace and compile your model, then load the pre-compiled artifacts
(via `NEURON_COMPILED_ARTIFACTS` environment variable) in vLLM to run inference workloads.
2025-05-28 19:44:01 -07:00
### Known limitations
- EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
2025-06-17 09:05:29 -04:00
[guide ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility )
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
2025-06-11 12:11:17 -04:00
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
2025-06-17 09:05:29 -04:00
[Neuron quantization guide ](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html )
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
2025-06-11 12:11:17 -04:00
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
2025-06-17 09:05:29 -04:00
runtime is not currently supported. Refer to [multi-lora example ](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py )
2025-05-28 19:44:01 -07:00
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
2025-06-17 09:05:29 -04:00
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
2025-05-28 19:44:01 -07:00
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
2025-06-17 09:05:29 -04:00
to this [multi-node example ](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node )
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
2025-06-11 12:11:17 -04:00
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
2025-06-17 09:05:29 -04:00
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
2025-05-28 19:44:01 -07:00
### Environment variables
2025-06-17 09:05:29 -04:00
2025-06-11 12:11:17 -04:00
- `NEURON_COMPILED_ARTIFACTS` : set this environment variable to point to your pre-compiled model artifacts directory to avoid
2025-06-17 09:05:29 -04:00
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
under this specified path.
2025-05-28 19:44:01 -07:00
- `NEURON_CONTEXT_LENGTH_BUCKETS` : Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
- `NEURON_TOKEN_GEN_BUCKETS` : Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).