Intel® Extension for PyTorch* optimizations for quantization

The quantization functionality in Intel® Extension for PyTorch* currently only supports post-training quantization. This tutorial introduces how the quantization works in the Intel® Extension for PyTorch* side.

We fully utilize Pytorch quantization components as much as possible, such as PyTorch Observer method. To make a PyTorch user be able to easily use the quantization API, API for quantization in Intel® Extension for PyTorch* is very similar to those in PyTorch. Intel® Extension for PyTorch* quantization supports a default recipe to automatically decide which operators should be quantized or not. This brings a satisfying performance and accuracy tradeoff.

Static Quantization

import intel_extension_for_pytorch as ipex
from intel_extension_for_pytorch.quantization import prepare, convert

Define qconfig

Using the default qconfig(recommended):

qconfig = ipex.quantization.default_static_qconfig
# equal to
# QConfig(activation=HistogramObserver.with_args(reduce_range=False),
#         weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric)) 

or define your own qconfig as:

from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig
qconfig = QConfig(activation=MinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8),
                  weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric))

Note: we fully use PyTorch observer methonds, so you can use a different PyTorch obsever methond to define the QConfig. For weight observer, we only support torch.qint8 dtype now.

Suggestion:

  1. For activation observer, if using qscheme as torch.per_tensor_affine, torch.quint8 is preferred. If using qscheme as torch.per_tensor_symmetric, torch.qint8 is preferred. For weight observer, setting qscheme to torch.per_channel_symmetric can get a better accuracy.

  2. If your CPU device doesn’t support VNNI, seting the observer’s reduce_range to True can get a better accuracy, such as skylake.

Prepare Model and Do Calibration

# prepare model, do conv+bn folding, and init model quant_state.
user_model = ...
user_model.eval()
example_inputs = ..
prepared_model = prepare(user_model, qconfig, example_inputs=example_inputs, inplace=False)

for x in calibration_data_set:
    prepared_model(x)

# Optional, if you want to tuning(performance or accuracy), you can save the qparams as json file which
# including the quantization state, such as scales, zero points and inference dtype.
# And then you can achange the json file's settings, loading the changed json file
# to model which will override the model's original quantization's settings.  
#  
# prepared_model.save_qconf_summary(qconf_summary = "configure.json")
# prepared_model.load_qconf_summary(qconf_summary = "configure.json")

Convert to Static Quantized Model and Deploy

# make sure the example_inputs's size is same as the real input's size 
convert_model = convert(prepared_model)
with torch.no_grad():
    traced_model = torch.jit.trace(convert_model, example_input)
    traced_model = torch.jit.freeze(traced_model)
# for inference 
y = traced_model(x)

# or save the model to deploy

# traced_model.save("quantized_model.pt")
# quantized_model = torch.jit.load("quantized_model.pt")
# quantized_model = torch.jit.freeze(quantized_model.eval())
# ...

Dynamic Quantization

import intel_extension_for_pytorch as ipex
from intel_extension_for_pytorch.quantization import prepare, convert

Define QConfig

Using the default qconfig(recommended):

dynamic_qconfig = ipex.quantization.default_dynamic_qconfig
# equal to 
# QConfig(activation=PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True),
#         weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric))

or define your own qconfig as:

from torch.ao.quantization import MinMaxObserver, PlaceholderObserver, QConfig
dynamic_qconfig = QConfig(activation = PlaceholderObserver.with_args(dtype=torch.float, is_dynamic=True),
                          weight = MinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_tensor_symmetric))

Note: For weight observer, it only supports dtype torch.qint8, and the qscheme can only be torch.per_tensor_symmetric or torch.per_channel_symmetric. For activation observer, it only supports dtype torch.float, and the compute_dtype can be torch.quint8 or torch.qint8.

Suggestion:

  1. For weight observer, setting qscheme to torch.per_channel_symmetric can get a better accuracy.

  2. If your CPU device doesn’t support VNNI, setting the observer’s reduce_range to True can get a better accuracy, such as skylake.

Prepare Model

prepared_model = prepare(user_model, dynamic_qconfig, example_inputs=example_inputs)

Convert to Dynamic Quantized Model and Deploy

# make sure the example_inputs's size is same as the real input's size
convert_model = convert(prepared_model)
# Optional: convert the model to traced model
#with torch.no_grad():
#    traced_model = torch.jit.trace(convert_model, example_input)
#    traced_model = torch.jit.freeze(traced_model)

# or save the model to deploy
# traced_model.save("quantized_model.pt")
# quantized_model = torch.jit.load("quantized_model.pt")
# quantized_model = torch.jit.freeze(quantized_model.eval())
# ...
# for inference 
y = convert_model(x)

Note: we only support the following ops to do dynamic quantization:

  • torch.nn.Linear

  • torch.nn.LSTM

  • torch.nn.GRU

  • torch.nn.LSTMCell

  • torch.nn.RNNCell

  • torch.nn.GRUCell