neural_compressor.onnxrt.algorithms.weight_only.awq
Module Contents
Functions
|
Quant the model with Activation-aware Weight quantization(AWQ) method. |
|
Apply Activation-aware Weight quantization(AWQ) on onnx model. |
- neural_compressor.onnxrt.algorithms.weight_only.awq.awq_quantize(model: onnx.ModelProto | neural_compressor.onnxrt.utils.onnx_model.ONNXModel | pathlib.Path | str, data_reader: neural_compressor.onnxrt.quantization.calibrate.CalibrationDataReader, weight_config: dict = {}, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', enable_auto_scale: bool = True, enable_mse_search: bool = True, accuracy_level: int = 0, providers: List[str] = ['CPUExecutionProvider']) onnx.ModelProto [source]
Quant the model with Activation-aware Weight quantization(AWQ) method.
- Parameters:
model (Union[onnx.ModelProto, ONNXModel, Path, str]) – onnx model.
data_reader (CalibrationDataReader) – data_reader for calibration.
weight_config (dict, optional) –
quantization config For example, weight_config = {
’(fc2, “MatMul”)’: {
’weight_dtype’: ‘int’, ‘weight_bits’: 4, ‘weight_group_size’: 32, ‘weight_sym’: True, ‘accuracy_level’: 0
}
}. Defaults to {}.
num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.
group_size (int, optional) – size of weight groups. Defaults to 32.
scheme (str, optional) – indicates whether weights are symmetric. Defaults to “asym”.
enable_auto_scale (bool, optional) – whether to search for best scales based on activation distribution. Defaults to True.
enable_mse_search (bool, optional) – whether to search for the best clip range from range [0.91, 1.0, 0.01]. Defaults to True.
accuracy_level (int, optional) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel). Defaults to 0.
providers (list, optional) – providers to use. Defaults to [“CPUExecutionProvider”].
- Returns:
quantized onnx model.
- Return type:
onnx.ModelProto
- neural_compressor.onnxrt.algorithms.weight_only.awq.apply_awq_on_model(model: onnx.ModelProto | neural_compressor.onnxrt.utils.onnx_model.ONNXModel | pathlib.Path | str, quant_config: dict, calibration_data_reader: neural_compressor.onnxrt.quantization.calibrate.CalibrationDataReader) onnx.ModelProto [source]
Apply Activation-aware Weight quantization(AWQ) on onnx model.
- Parameters:
model (Union[onnx.ModelProto, ONNXModel, Path, str]) – nnx model.
quant_config (dict) – quantization config.
calibration_data_reader (CalibrationDataReader) – data_reader for calibration.
- Returns:
quantized onnx model.
- Return type:
onnx.ModelProto