Smooth Quant Recipe Tuning API (Prototype)

Smooth Quantization is a popular method to improve the accuracy of int8 quantization. The autotune API allows automatic global alpha tuning, and automatic layer-by-layer alpha tuning provided by Intel® Neural Compressor for the best INT8 accuracy.

SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. SmoothQuant arguments are as below:

Arguments Default Value Available Values Comments
alpha 'auto' [0-1] / 'auto' value to balance input and weight quantization error.
init_alpha 0.5 [0-1] / 'auto' value to get baseline quantization error for auto-tuning.
alpha_min 0.0 [0-1] min value of auto-tuning alpha search space
alpha_max 1.0 [0-1] max value of auto-tuning alpha search space
alpha_step 0.1 [0-1] step_size of auto-tuning alpha search space
shared_criterion "mean" ["min", "mean","max"] criterion for input LayerNorm op of a transformer block.
enable_blockwise_loss False [True, False] whether to enable block-wise auto-tuning

For LLM examples, please refer to example.

Note: When defining dataloaders for calibration, please follow INC’s dataloader format.