neural_compressor.adaptor.ox_utils.smooth_quant
SmoothQuant for onnxrt adaptor.
Classes
Fake input channel quantization. |
Functions
|
Get loss between fp32 output and QDQ output. |
|
Build a model with the specific node. |
|
Quantize and then dequantize data. |
Module Contents
- neural_compressor.adaptor.ox_utils.smooth_quant.get_quant_dequant_output(model, input_data, output_data, reduce_range, backend)[source]
Get loss between fp32 output and QDQ output.
- Parameters:
model (object) – model
input_data (numpy.ndarray) – fp32 input
output_data (numpy.ndarray) – fp32 output
reduce_range (bool) – use 7 bit or not
backend (str) – execution provider
- neural_compressor.adaptor.ox_utils.smooth_quant.make_sub_graph(node, inits, input_data, output_data, reduce_range, opset, ir_version)[source]
Build a model with the specific node.
- Parameters:
node (object) – node
inits (list) – initializer inputs of this node
input_data (numpy.ndarray) – fp32 input
output_data (numpy.ndarray) – fp32 output
reduce_range (bool) – use 7 bit or not
opset (object) – opset of the model
ir_version (object) – ir_version of the model
- neural_compressor.adaptor.ox_utils.smooth_quant.quant_dequant_data(data, reduce_range=False, qType=3, scheme='sym')[source]
Quantize and then dequantize data.
- Parameters:
data (numpy.ndarray) – target data
reduce_range (bool) – use 7 bit or not
qType (int) – data type
scheme (str) – sym or asym quantization
- class neural_compressor.adaptor.ox_utils.smooth_quant.ORTSmoothQuant(model, dataloader, reduce_range=False, backend='CPUExecutionProvider')[source]
Fake input channel quantization.
For more details please refer to: [1] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [2] SPIQ: Data-Free Per-Channel Static Input Quantization We only support inplace mode which means the model weights will be changed, you can call recover function to recover the weights if needed.