Export
=====
1. [Introduction](#introduction)
2. [Supported Framework Model Matrix](#supported-framework-model-matrix)
3. [Examples](#examples)
4. [Appendix](#appendix)
# Introduction
Open Neural Network Exchange (ONNX) is an open standard format for representing machine learning models. Exporting FP32 PyTorch/Tensorflow models has become popular and easy to use. However, for Intel Neural Compressor, we hope to export the INT8 model into the ONNX format to achieve higher applicability in multiple frameworks.
Here we briefly introduce our export API for PyTorch FP32/INT8 models. First, the INT8 ONNX model is not directly exported from the INT8 PyTorch model, but quantized after obtaining the FP32 ONNX model using the mature torch.onnx.export API. To ensure the majority of the quantization process of ONNX is consistent with PyTorch, we reuse three key pieces of information from the Neural Compressor model to perform ONNX quantization.
- Quantized operations: Only operations quantized in PyTorch will be quantized in the quantization process of ONNX.
- Scale info: Scale information is collected from the quantization process of PyTorch.
- Weights of quantization aware training(QAT): For quantization aware training, the updated weights are passed to the ONNX model.
| Recipe | QDQ | QLinear |
|---|---|---|
| QDQ_OP_FP32_BIAS |
QuantizeLinear
|
DequantizeLinear
|
MatMul
|
Add
|
QuantizeLinear
|
MatMulIntegerToFloat
|
Add
|
| QDQ_OP_INT32_BIAS |
QuantizeLinear
|
MatMulInteger
|
Add
|
Cast
|
Mul
|
QuantizeLinear
|
MatMulInteger
|
Add
|
Cast
|
Mul
|
| QDQ_OP_FP32_BIAS_QDQ |
QuantizeLinear
|
DequantizeLinear
|
MatMul
|
Add
|
QuantizeLinear
|
DequantizeLinear
|
QuantizeLinear
|
MatMulIntegerToFloat
|
Add
|
QuantizeLinear
|
DequantizeLinear
|