TensorFlow

Introduction

neural_compressor.tensorflow provides a integrated API for applying quantization on various TensorFlow model format, such as pb, saved_model and keras. The comprehensive range of supported models includes but not limited to CV models, NLP models, and large language models.

In terms of ease of use, neural compressor is committed to providing flexible and scalable user interfaces. While quantize_model is designed to provide a fast and straightforward quantization experience, the autotune offers an advanced option of reducing accuracy loss during quantization.

API for TensorFlow

Intel(R) Neural Compressor provides quantize_model and autotune as main interfaces for supported algorithms on TensorFlow framework.

quantize_model

The design philosophy of the quantize_model interface is easy-of-use. With minimal parameters requirement, including model, quant_config, calib_dataloader, calib_iteration, it offers a straightforward choice of quantizing TF model in one-shot.

def quantize_model(
    model: Union[str, tf.keras.Model, BaseModel],
    quant_config: Union[BaseConfig, list],
    calib_dataloader: Callable = None,
    calib_iteration: int = 100,
    calib_func: Callable = None,
):

model should be a string of the model’s location, the object of Keras model or INC TF model wrapper class.

quant_config is either the StaticQuantConfig object or a list contains SmoothQuantConfig and StaticQuantConfig to indicate what algorithm should be used and what specific quantization rules should be applied.

calib_dataloader is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset.

calib_iteration is used to decide how many iterations the calibration process will be run.

calib_func is a substitution for calib_dataloader when the built-in calibration function of INC does not work for model inference.

Here is a simple example of using quantize_model interface with a dummy calibration dataloader and the default StaticQuantConfig:

from neural_compressor.tensorflow import StaticQuantConfig, quantize_model
from neural_compressor.tensorflow.utils import DummyDataset

dataset = DummyDataset(shape=(100, 32, 32, 3), label=True)
calib_dataloader = MyDataLoader(dataset=dataset)
quant_config = StaticQuantConfig()

qmodel = quantize_model("fp32_model.pb", quant_config, calib_dataloader)

autotune

The autotune interface, on the other hand, provides greater flexibility and power. It’s particularly useful when accuracy is a critical factor. If the initial quantization doesn’t meet the tolerance of accuracy loss, autotune will iteratively try quantization rules according to the tune_config.

Just like quantize_model, autotune requires model, calib_dataloader and calib_iteration. And the eval_fn, eval_args are used to build evaluation process.

def autotune(
    model: Union[str, tf.keras.Model, BaseModel],
    tune_config: TuningConfig,
    eval_fn: Callable,
    eval_args: Optional[Tuple[Any]] = None,
    calib_dataloader: Callable = None,
    calib_iteration: int = 100,
    calib_func: Callable = None,
) -> Optional[BaseModel]:

model should be a string of the model’s location, the object of Keras model or INC TF model wrapper class.

tune_config is the TuningConfig object which contains multiple quantization rules.

eval_fn is the evaluation function that measures the accuracy of a model.

eval_args is the supplemental arguments required by the defined evaluation function.

calib_dataloader is used to load the data samples for calibration phase. In most cases, it could be the partial samples of the evaluation dataset.

calib_iteration is used to decide how many iterations the calibration process will be run.

calib_func is a substitution for calib_dataloader when the built-in calibration function of INC does not work for model inference.

Here is a simple example of using autotune interface with different quantization rules defined by a list of StaticQuantConfig:

from neural_compressor.common.base_tuning import TuningConfig
from neural_compressor.tensorflow import StaticQuantConfig, autotune

calib_dataloader = MyDataloader(dataset=Dataset())
custom_tune_config = TuningConfig(
    config_set=[
        StaticQuantConfig(weight_sym=True, act_sym=True),
        StaticQuantConfig(weight_sym=False, act_sym=False),
    ]
)
best_model = autotune(
    model="baseline_model",
    tune_config=custom_tune_config,
    eval_fn=eval_acc_fn,
    calib_dataloader=calib_dataloader,
)

Support Matrix

Quantization Scheme

Framework Backend Library Symmetric Quantization Asymmetric Quantization
TensorFlow oneDNN Activation (int8/uint8), Weight (int8) -
Keras ITEX Activation (int8/uint8), Weight (int8) -

Quantization Approaches

The supported Quantization methods for TensorFlow and Keras are listed below:

Types Quantization Dataset Requirements Framework Backend
Post-Training Static Quantization (PTQ) weights and activations calibration Keras ITEX
TensorFlow TensorFlow/Intel TensorFlow
Smooth Quantization(SQ) weights calibration Tensorflow TensorFlow/Intel TensorFlow
Mixed Precision(MP) weights and activations NA Tensorflow TensorFlow/Intel TensorFlow


Post Training Static Quantization

The min/max range in weights and activations are collected offline on a so-called calibration dataset. This dataset should be able to represent the data distribution of those unseen inference dataset. The calibration process runs on the original fp32 model and dumps out all the tensor distributions for Scale and ZeroPoint calculations. Usually preparing 100 samples are enough for calibration.

Refer to the PTQ Guide for detailed information.

Smooth Quantization

Smooth Quantization (SQ) is an advanced quantization technique designed to optimize model performance while maintaining high accuracy. Unlike traditional quantization methods that can lead to significant accuracy loss, SQ focuses on a more refined approach by taking a balance between the scale of activations and weights.

Refer to the SQ Guide for detailed information.

Mixed Precision

The Mixed Precision (MP) is enabled with Post Training Static Quantization. Once BF16 is supported on machine, the matched operators will be automatically converted.

Backend and Device

Intel(R) Neural Compressor supports TF GPU with ITEX-XPU. We will automatically run model on GPU by checking if it has been installed.

Framework Backend Backend Library Backend Value Support Device(cpu as default)
TensorFlow TensorFlow OneDNN "default" cpu
ITEX OneDNN "itex" cpu | gpu