# Quantization Quantization is a widely-used model compression technique that can reduce model size while also improving inference and training latency. The full precision data converts to low-precision, there is little degradation in model accuracy, but the inference performance of quantized model can gain higher performance by saving the memory bandwidth and accelerating computations with low precision instructions. Intel provided several lower precision instructions (ex: 8-bit or 16-bit multipliers), both training and inference can get benefits from them. Refer to the Intel article on [lower numerical precision inference and training in deep learning](https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html). ## Quantization Support Matrix Quantization methods include the following three types:
| Types | Quantization | Dataset Requirements | Framework | Backend |
|---|---|---|---|---|
| Post-Training Static Quantization (PTQ) | weights and activations | calibration | PyTorch | PyTorch Eager/PyTorch FX/IPEX |
| TensorFlow | TensorFlow/Intel TensorFlow | |||
| ONNX Runtime | QLinearops/QDQ | |||
| Post-Training Dynamic Quantization | weights | none | PyTorch | PyTorch eager mode/PyTorch fx mode/IPEX |
| ONNX Runtime | QIntegerops | |||
| Quantization-aware Training (QAT) | weights and activations | fine-tuning | PyTorch | PyTorch eager mode/PyTorch fx mode/IPEX |
| TensorFlow | TensorFlow/Intel TensorFlow |
## Examples of Quantization
For Quantization related examples, please refer to [Quantization examples](/examples/README.md)