Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) (#12010)

Signed-off-by: Nir David <ndavid@habana.ai> Signed-off-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai>
2025-07-16 22:33:41 +03:00
parent ac2bf41e53
commit 01513a334a
11 changed files with 168 additions and 25 deletions
--- a/docs/getting_started/installation/intel_gaudi.md
+++ b/docs/getting_started/installation/intel_gaudi.md
@@ -28,7 +28,7 @@ To verify that the Intel Gaudi software was correctly installed, run:
 hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
 apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
 pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
-pip list | grep neural # verify that neural_compressor is installed
+pip list | grep neural # verify that neural_compressor_pt is installed
 ```

 Refer to [Intel Gaudi Software Stack Verification](https://docs.habana.ai/en/latest/Installation_Guide/SW_Verification.html#platform-upgrade)
@@ -120,12 +120,13 @@ docker run \
 - Inference with [HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html)
  for accelerating low-batch latency and throughput
 - Attention with Linear Biases (ALiBi)
+- INC quantization

 ### Unsupported features

 - Beam search
 - LoRA adapters
- Quantization
+- AWQ quantization
 - Prefill chunking (mixed-batch inferencing)

 ### Supported configurations