How to Support New Data Type, Like Int4, with a Few Line Changes
Introduction
To enable accuracy-aware tuning with various frameworks, Intel® Neural Compressor introduced the framework YAML which unifies the configuration format for quantization and provides a description for the capabilities of specific framework. Before explaining how to add a new data type, let’s first introduce the overall process, from defining the operator behavior in YAML to invoking it by the adaptor. The diagram below illustrates all the relevant steps, with additional details provided for each annotated step.
Note: The
adaptor
is a layer that abstracts various frameworks supported by Intel® Neural Compressor.
sequenceDiagram
autonumber
Strategy ->> Adaptor: query framework capability
Adaptor ->> Strategy: Parse the framework YAML and return capability
Strategy ->> Strategy: Build tuning space
loop Traverse tuning space
Strategy->> Adaptor: generate next tuning cfg
Adaptor ->> Adaptor: calibrate and quantize model based on tuning config
end
Strategy: Drives the overall tuning process and utilizes
adaptor.query_fw_capability
to query the framework’s capabilities.Adaptor: Parses the framework YAML, filters some corner cases, and constructs the framework capability. This includes the capabilities of each operator and other model-related information.
Strategy: Constructs the tuning space based on the framework capability and initiates the tuning process.
Strategy: Generates the tuning configurations for each operators of the model using the tuning space constructed in the previous step, specifying the desired tuning process.
Adaptor: Invokes the specific kernels for the calibration and quantization based on the tuning configuration.
The following section provides an example of extending the PyTorch Conv2d
operator to include support for 4-bit quantization.
Define the Quantization Ability of the Specific Operator
The first step in adding a new data type for specific operator to Intel® Neural Compressor is to extend the capabilities of operator and include it to the framework YAML. The capabilities should include the quantized data types and quantization schemes of activation and weight(if applicable). The following table describes the detail of each filed:
Field name | Options | Description |
---|---|---|
Data Type (dtype ) |
uint4 , int4 |
The quantization data type being added. It use 4-bit as example, where uint4 represents an unsigned 4-bit integer and int4 represents a signed 4-bit integer. |
Quantization (scheme ) |
sym , asym |
The quantization scheme used for the new data type. sym represents symmetric quantization, asym represents asymmetric quantization. |
Quantization Granularity (granularity ) |
per_channel , per_tensor |
The granularity at which quantization is applied. per_channel represents that the quantization is applied independently per channel, per_tensor represents that the quantization is applied to the entire tensor as a whole. |
Calibration Algorithm (algorithm ) |
minmax , kl |
The calibration algorithm used for the new data type. minmax represents the minimum-maximum algorithm, kl represents the Kullback-Leibler divergence algorithm. |
To add 4-bit quantization for Conv2d
in the PyTorch backend. We can modify the neural_compressor/adaptor/pytorch_cpu.yaml
as follows:
...
fp32: ['*'] # `*` means all op types.
+ int4: {
+ 'static': {
+ 'Conv2d': {
+ 'weight': {
+ 'dtype': ['int4'],
+ 'scheme': ['sym'],
+ 'granularity': ['per_channel'],
+ 'algorithm': ['minmax']},
+ 'activation': {
+ 'dtype': ['uint4'],
+ 'scheme': ['sym'],
+ 'granularity': ['per_tensor'],
+ 'algorithm': ['minmax']},
+ },
+ }
+ }
int8: &1_11_capabilities {
'static': &cap_s8_1_11 {
'Conv1d': &cap_s8_1_11_Conv1d {
...
The code states that the PyTorch Conv2d Operator has the ability to quantize weights to int4 using the torch.per_channel_symmetric
quantization scheme, with the supported calibration algorithm being minmax
. Additionally, the operator can quantize activations to uint4
using the torch.per_tensor_symmetric
quantization scheme, with the supported calibration algorithm also being minmax
.
Invoke the Operator Kernel According to the Tuning Configuration
One of the tuning configurations generated by the strategy for Conv2d
looks like as following:
tune_cfg = {
'op': {
('conv', 'Conv2d'): {
'weight': {
'dtype': 'int4',
'algorithm': 'minmax',
'granularity': 'per_channel',
'scheme': 'sym'
},
'activation': {
'dtype': 'uint4',
'quant_mode': 'static',
'algorithm': 'kl',
'granularity': 'per_tensor',
'scheme': 'sym'
}
},
Now, we can invoke the specified kernel according to the above configurations in the adaptor’s quantize
function. Due to PyTorch currently not having native support for quantization with 4-bit for Conv2d
, we simulate it numerically by specifying the value ranges of a given data type in the observer. We have implemented it with the following code:
return observer.with_args(qscheme=qscheme,
dtype=torch_dtype,
reduce_range=(REDUCE_RANGE and scheme == 'asym'),
+ quant_min=quant_min,
+ quant_max=quant_max
)
Note: For PyTorch backend, this simulation only supports N-bit quantization, where N is an integer between 1 and 7.
Use the New Data Type
Once the new data type has been added to Intel® Neural Compressor, it can be used in the same way as any other data type within the framework. Below is an example of specifying that all Conv2d
operators should utilize 4-bit quantization:”
from neural_compressor.config import PostTrainingQuantConfig
op_type_dict = {
"Conv2d": {
"weight": {
"dtype": ["int4"],
},
"activation": {
"dtype": ["uint4"],
},
},
}
conf = PostTrainingQuantConfig(op_type_dict=op_type_dict)
...
With this code, all Conv2d
operators will be quantized to 4-bit, with weight using int4
and activation using uint4
.
Summary
The document outlines the process of adding support for a new data type, such as int4, in Intel® Neural Compressor with minimal changes. It provides instructions and code examples for defining the data type’s quantization capabilities, invoking the operator kernel, and using the new data type within the framework. By following the steps outlined in the document, users can extend Intel® Neural Compressor’s functionality to accommodate new data types and incorporate them into their quantization workflows.