Quantize Inception V3 by Intel® Extension for Tensorflow* on Intel® Xeon®¶

Background¶

Intel® Extension for Tensorflow* provides quantization feature by cooperating with Intel® Neural Compressor and oneDNN Graph. It will provide better quantization: better performance and accuracy loss under control.

Intel® Neural Compressor executes the calibration process to output the QDQ quantization model which inserts Quantize and Dequantize layers to includes help information for quantization.

When use Intel® Extension for Tensorflow* to execute the inference of this model, oneDNN Graph will be called to quantize and optimize the model. Then the quantized model will be executed by Intel® Extension for Tensorflow* and accelerated by Intel® Deep Learning Boost or Intel® Advanced Matrix Extensions on Intel® Xeon®.

Introduction¶

The example shows an end-to-end pipeline:

Train an Inception V3 model with a flower photo dataset by transfer learning.
Execute the calibration by Intel® Neural Compressor.
Quantize and accelerate the inference by Intel® Extension for Tensorflow* for CPU.

Configuration¶

Intel® Extension for Tensorflow* Version¶

Please install Intel® Extension for Tensorflow* > 1.1.0 and newer for this feature.

Enable oneDNN Graph¶

By default, oneDNN Graph is enabled in Intel® Extension for Tensorflow* on CPU for INT8 models.

Enable it explicitly by:

  import os
  os.environ["ITEX_ONEDNN_GRAPH"] = "1"

Disable Constant Folding Function¶

We need to disable Constant Folding function in 2 stages:

Intel® Neural Compressor creates QDQ quantization model.
Intel® Extension for Tensorflow* executes the oneDNN Graph quantization path.

There are 2 methods to configure:

a. Environment Variable

export ITEX_TF_CONSTANT_FOLDING=0

b. Python API

from tensorflow.core.protobuf import rewriter_config_pb2

infer_config = tf.compat.v1.ConfigProto()
infer_config.graph_options.rewrite_options.constant_folding = rewriter_config_pb2.RewriterConfig.OFF

session = tf.compat.v1.Session(config=infer_config)
tf.compat.v1.keras.backend.set_session(session)

Hardware Environment¶

CPU¶

It’s recommended to run the example on the Intel® Xeon® which supports Intel® Deep Learning Boost or Intel® Advanced Matrix Extensions.

Without the hardware features above for AI workloads, the performance speedup with FP32 will not be increased much, such as only 1.x.

Check Intel® Deep Learning Boost¶

In Linux, run command:

lscpu | grep vnni

Check Intel® Advanced Matrix Extensions¶

In Linux, run command:

lscpu | grep amx

Intel® DevCloud¶

If you have no such CPU support Intel® Deep Learning Boost or Intel® Advanced Matrix Extensions, you could register to Intel® DevCloud and try this example on new Xeon with Intel® Deep Learning Boost freely. To learn more about working with Intel® DevCloud, please refer to Intel® DevCloud

Running Environment¶

Install Python 3.7~3.10 supported by Intel® Extension for Tensorflow*.
Create the running environment env_itex.

bash pip_set_env.sh

Activate

source env_itex/bin/activate

Startup Jupyter Notebook¶

Startup

bash run_jupyter.sh

...
http://xxx.yyy.com:8888/xxxxxxxx

Open the link outputted by Jupyter Notebook in Chrome.
Choose and open the quantize_inception_v3.ipynb in Jupyter Notebook.

Set the kernel to “env_itex”.

Execute the code as the guide.

License¶

Code samples are licensed under the MIT license.