neural_compressor.algorithm.smooth_quant

Build SmoothQuant algorithm class.

Classes

SmoothQuant

Fake input channel quantization.

Module Contents

class neural_compressor.algorithm.smooth_quant.SmoothQuant(alpha=0.5)[source]

Fake input channel quantization.

for more details please refer to [1] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [2] SPIQ: Data-Free Per-Channel Static Input Quantization For torch backend, we only handle the layers whose smooth scale could be absorbed, we will support other layers later. For onnx backend, we insert MUL layer before conv/linear layers, the op fusing and kernel will be supported in the future.