neural_compressor.onnxrt.algorithms.weight_only.gptq

Module Contents

Functions

gptq_quantize(model, data_reader[, weight_config, ...])

Quant the model with GPTQ method.

apply_gptq_on_model(→ onnx.ModelProto)

Apply GPTQ on onnx model.

neural_compressor.onnxrt.algorithms.weight_only.gptq.gptq_quantize(model: onnx.ModelProto | neural_compressor.onnxrt.utils.onnx_model.ONNXModel | pathlib.Path | str, data_reader: neural_compressor.onnxrt.quantization.calibrate.CalibrationDataReader, weight_config: dict = {}, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', percdamp: float = 0.01, blocksize: int = 128, actorder: bool = False, mse: bool = False, perchannel: bool = True, accuracy_level: int = 0, providers: List[str] = ['CPUExecutionProvider'], return_modelproto: bool = True)[source]

Quant the model with GPTQ method.

Parameters:
  • model (Union[onnx.ModelProto, ONNXModel, Path, str]) – onnx model.

  • data_reader (CalibrationDataReader) – data_reader for calibration.

  • weight_config (dict, optional) –

    quantization config For example, weight_config = {

    ’(fc2, “MatMul”)’:
    {

    ‘weight_dtype’: ‘int’, ‘weight_bits’: 4, ‘weight_group_size’: 32, ‘weight_sym’: True, ‘accuracy_level’: 0

    }. Defaults to {}.

  • num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.

  • group_size (int, optional) – size of weight groups. Defaults to 32.

  • scheme (str, optional) – indicates whether weights are symmetric. Defaults to “asym”.

  • percdamp (float, optional) – percentage of Hessian’s diagonal values’ average, which will be added to Hessian’s diagonal to increase numerical stability. Defaults to 0.01.

  • blocksize (int, optional) – execute GPTQ quantization per block. Defaults to 128.

  • actorder (bool, optional) – whether to sort Hessian’s diagonal values to rearrange channel-wise quantization order. Defaults to False.

  • mse (bool, optional) – whether get scale and zero point with mse error. Defaults to False.

  • perchannel (bool, optional) – whether quantize weight per-channel. Defaults to True.

  • accuracy_level (int, optional) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel). Defaults to 0.

  • providers (list, optional) – providers to use. Defaults to [“CPUExecutionProvider”].

  • return_modelproto (bool, optionmal) – whether to return onnx.Modelproto. set False for layer-wise quant. Default to True

Returns:

quantized onnx model

Return type:

onnx.ModelProto

neural_compressor.onnxrt.algorithms.weight_only.gptq.apply_gptq_on_model(model: onnx.ModelProto | neural_compressor.onnxrt.utils.onnx_model.ONNXModel | pathlib.Path | str, quant_config: dict, calibration_data_reader: neural_compressor.onnxrt.quantization.calibrate.CalibrationDataReader) onnx.ModelProto[source]

Apply GPTQ on onnx model.

Parameters:
  • model (Union[onnx.ModelProto, ONNXModel, Path, str]) – onnx model.

  • quant_config (dict) – quantization config.

  • calibration_data_reader (CalibrationDataReader) – data_reader for calibration.

Returns:

quantized onnx model.

Return type:

onnx.ModelProto