Calibration Algorithms in Quantization
Introduction
Quantization proves beneficial in terms of reducing the memory and computational requirements of the model. Uniform quantization transforms the input value $x ∈ [β, α]$ to lie within $[−2^{b−1}, 2^{b−1} − 1]$, where $[β, α]$ is the range of real values chosen for quantization and $b$ is the bit-width of the signed integer representation. Calibration is the process of determining the $α$ and $β$ for model weights and activations. Refer to this link for more quantization fundamentals
Calibration Algorithms
Currently, Intel® Neural Compressor supports three popular calibration algorithms:
MinMax: This method gets the maximum and minimum of input values as $α$ and $β$ [^1]. It preserves the entire range and is the simplest approach.
Entropy: This method minimizes the KL divergence to reduce the information loss between full-precision and quantized data [^2]. Its primary focus is on preserving essential information.
Percentile: This method only considers a specific percentage of values for calculating the range, ignoring the remainder which may contain outliers [^3]. It enhances resolution by excluding extreme values but still retaining noteworthy data.
Support Matrix
Framework | Supported calibration algorithm | |
---|---|---|
weight | activation | |
Pytorch | minmax | minmax, kl |
Tensorflow | minmax | minmax, kl |
MXNet | minmax | minmax, kl |
OnnxRuntime | minmax | minmax, kl, percentile |
kl
is used to represent the Entropy calibration algorithm in Intel® Neural Compressor.
The calibration algorithm is one of the tuning items utilized by Intel® Neural Compressor auto-tuning. The accuracy-aware tuning process will select an appropriate algorithm. Please refer to tuning_strategies.html for more details.
Reference
[^1]: Vanhoucke, Vincent, Andrew Senior, and Mark Z. Mao. “Improving the speed of neural networks on CPUs.” (2011).
[^2]: Szymon Migacz. “Nvidia 8-bit inference width tensorrt.” (2017).
[^3]: McKinstry, Jeffrey L., et al. “Discovering low-precision networks close to full-precision networks for efficient embedded inference.” arXiv preprint arXiv:1809.04191 (2018).