[Hardware][CPU] Support MOE models on x86 CPU (#11831)

Signed-off-by: jiang1.li <jiang1.li@intel.com>
This commit is contained in:
Li, Jiang
2025-01-11 00:07:58 +08:00
committed by GitHub
parent 5959564f94
commit aa1e77a19c
3 changed files with 43 additions and 4 deletions

View File

@@ -5,7 +5,7 @@
vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features:
- Tensor Parallel
- Model Quantization (`INT8 W8A8, AWQ`)
- Model Quantization (`INT8 W8A8, AWQ, GPTQ`)
- Chunked-prefill
- Prefix-caching
- FP8-E5M2 KV-Caching (TODO)