Smooth Quant Recipe Tuning API (Prototype)
Smooth Quantization is a popular method to improve the accuracy of int8 quantization. The autotune API allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy.
SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. SmoothQuant arguments are as below:
Arguments | Default Value | Available Values | Comments |
---|---|---|---|
alpha | 'auto' | [0-1] / 'auto' | value to balance input and weight quantization error. |
init_alpha | 0.5 | [0-1] / 'auto' | value to get baseline quantization error for auto-tuning. |
alpha_min | 0.0 | [0-1] | min value of auto-tuning alpha search space |
alpha_max | 1.0 | [0-1] | max value of auto-tuning alpha search space |
alpha_step | 0.1 | [0-1] | step_size of auto-tuning alpha search space |
shared_criterion | "mean" | ["min", "mean","max"] | criterion for input LayerNorm op of a transformer block. |
enable_blockwise_loss | False | [True, False] | whether to enable block-wise auto-tuning |
For LLM examples, please refer to example.
Note: When defining dataloaders for calibration, please follow INC’s dataloader format.