TensorFlow Quantization

  1. Introduction

  2. Usage
    2.1 Without Accuracy Aware Tuning
    2.2 With Accuracy Aware Tuning
    2.3 Specify Quantization Rules

  3. Examples

Introduction

neural_compressor.tensorflow supports quantizing both TensorFlow and Keras model with or without accuracy aware tuning.

For the detailed quantization fundamentals, please refer to the document for Quantization.

Get Started

Without Accuracy Aware Tuning

This means user could leverage Intel(R) Neural Compressor to directly generate a fully quantized model without accuracy aware tuning. It’s user responsibility to ensure the accuracy of the quantized model meets expectation.

# main.py

# Original code
model = tf.keras.applications.resnet50.ResNet50(weights="imagenet")
val_dataset = ...
val_dataloader = MyDataloader(dataset=val_dataset)

# Quantization code
from neural_compressor.tensorflow import quantize_model, StaticQuantConfig

quant_config = StaticQuantConfig()
qmodel = quantize_model(
    model=model,
    quant_config=quant_config,
    calib_dataloader=val_dataloader,
)
qmodel.save("./output")

With Accuracy Aware Tuning

This means user could leverage the advance feature of Intel(R) Neural Compressor to tune out a best quantized model which has best accuracy and good performance. User should provide eval_fn and eval_args.

# main.py

# Original code
model = tf.keras.applications.resnet50.ResNet50(weights="imagenet")
val_dataset = ...
val_dataloader = MyDataloader(dataset=val_dataset)


def eval_acc_fn(model) -> float:
    ...
    return acc


# Quantization code
from neural_compressor.common.base_tuning import TuningConfig
from neural_compressor.tensorflow import autotune

# it's also supported to define custom_tune_config as:
# TuningConfig(StaticQuantConfig(weight_sym=[True, False], act_sym=[True, False]))
custom_tune_config = TuningConfig(
    config_set=[
        StaticQuantConfig(weight_sym=True, act_sym=True),
        StaticQuantConfig(weight_sym=False, act_sym=False),
    ]
)
best_model = autotune(
    model=model,
    tune_config=custom_tune_config,
    eval_fn=eval_acc_fn,
    calib_dataloader=val_dataloader,
)
best_model.save("./output")

Specify Quantization Rules

Intel(R) Neural Compressor support specify quantization rules by operator name or operator type. Users can set local in dict or use set_local method of config class to achieve the above purpose.

  1. Example of setting local from a dict

quant_config = {
    "static_quant": {
        "global": {
            "weight_dtype": "int8",
            "weight_sym": True,
            "weight_granularity": "per_tensor",
            "act_dtype": "int8",
            "act_sym": True,
            "act_granularity": "per_tensor",
        },
        "local": {
            "conv1": {
                "weight_dtype": "fp32",
                "act_dtype": "fp32",
            }
        },
    }
}
config = StaticQuantConfig.from_dict(quant_config)
  1. Example of using set_local

quant_config = StaticQuantConfig()
conv2d_config = StaticQuantConfig(
    weight_dtype="fp32",
    act_dtype="fp32",
)
quant_config.set_local("conv1", conv2d_config)

Examples

Users can also refer to examples on how to quantize a TensorFlow model with neural_compressor.tensorflow.