Accelerate AlexNet by Quantization with Intel® Extenstion for Tensorflow*¶

Background¶

Low-precision inference can speed up inference obviously, by converting the FP32 model to INT8 or BF16 model. Intel provides hardware technology to accelerate the low precision model on Intel CPU & GPU:

Intel® Deep Learning Boost: It is present in the Second Generation Intel® Xeon® Scalable Processors and newer Xeon®, which supports to speed up INT8 and BF16 model by hardware.
Intel GPU which supports INT8.

Intel® Neural Compressor helps the user to simplify the processing to convert the FP32 model to INT8.

At the same time, Intel® Neural Compressor will tune the quantization method to reduce the accuracy loss, which is a big blocker for low-precision inference.

Intel® Neural Compressor is released in Intel® AI Analytics Toolkit and works with Intel® Optimization of TensorFlow*.

Please refer to the official website for detailed info and news: https://github.com/intel/neural-compressor

Introduction¶

By Intel® Extenstion for Tensorflow*, it’s easy to quantize FP32 model to INT8 model and be accelerated on Intel CPU and GPU.

The example reuses the existed End-To-End example: Intel® Neural Compressor Sample for TensorFlow provided by Intel® Neural Compressor, to show a pipeline to build up a CNN model to recognize handwriting number and speed up AI model with quantization by Intel® Neural Compressor.

The original example is designed to run on Intel CPU. After installing Intel® Extenstion for Tensorflow*, it could run on Intel CPU and GPU.

All steps follow this existed example. There is no any code to be changed.

Please read the example guide for detailed information.

We will learn the acceleration of AI inference by Intel AI technology:

Intel® Deep Learning Boost on CPU
Intel GPU which supports INT8
Intel® Neural Compressor
Intel® Extenstion for Tensorflow*

Hardware Environment¶

The example can run on Intel CPU & GPU by Intel® Extenstion for Tensorflow*.

CPU¶

This demo is recommended to use 2nd Generation Intel® Xeon® Scalable Processors or newer, which include:

Intel® AVX512 instruction to speed up training & inference AI model.
Intel® Deep Learning Boost: Vector Neural Network Instruction (VNNI) to accelerate AI/DL Inference with INT8/BF16 Model.

With Intel® Deep Learning Boost, the performance will be increased obviously. Without it, maybe it’s 1.x times of FP32.

Intel® DevCloud¶

If you have no such CPU support Intel® Deep Learning Boost, you could register to Intel® DevCloud and try this example on new Xeon with Intel® Deep Learning Boost freely. To learn more about working with Intel® DevCloud, please refer to Intel® DevCloud

GPU¶

Support: Intel® Data Center Flex Series GPU.

For local server, please install the GPU driver and oneAPI packages by refer to Intel GPU Software Installation.

For Intel® DevCloud, the GPU driver and oneAPI packages are already installed.

Running Environment¶

Set up Base Running Environment¶

Please refer to the example: Intel® Neural Compressor Sample for TensorFlow to setup running environment.

There are new requirements:

Python should be 3.9 or newer version.
Tensorflow should be 2.10.0 or newer version.

Set up Intel® Extenstion for Tensorflow*¶

Please install Intel® Extenstion for Tensorflow* in the running envrionment:

CPU

python -m pip install --upgrade intel-extension-for-tensorflow[cpu]

GPU

python -m pip install --upgrade intel-extension-for-tensorflow[gpu]

Execute¶

Please refer to the example: Intel® Neural Compressor Sample for TensorFlow to execute the sample code and check the result.