neural_compressor.onnxrt.algorithms.weight_only.utility

Module Contents

Functions

make_matmul_weight_only_node(node, weight_shape, ...)

Build MatMulFpQ4/MatMulNBits node.

prepare_inputs(model, data_reader, providers)

Prepare inputs for weight only quantization.

pad_tensor(weight, group_size, k_blocks)

Pad tensor rowi so that it can be is divisible by group_size.

quant_tensor(data[, num_bits, group_size, scheme, ...])

Quantize tensor per group.

qdq_tensor(data[, num_bits, group_size, scheme, ...])

Quant dequant tensor per group.

neural_compressor.onnxrt.algorithms.weight_only.utility.make_matmul_weight_only_node(node: onnx.NodeProto, weight_shape: tuple, num_bits: int, group_size: int, k_blocks: int, q_weight: numpy.array, scale: numpy.array, zero_point: numpy.array, accuracy_level: int = 0)[source]

Build MatMulFpQ4/MatMulNBits node.

Parameters:
  • node (onnx.NodeProto) – original matmul node

  • weight_shape (tuple) – original weight shape

  • num_bits (int) – number of bits used to represent weights.

  • group_size (int) – how many elements share one scale/zp

  • k_blocks (int) – block number

  • q_weight (np.array) – quantized weight

  • scale (np.array) – scale

  • zero_point (np.array) – zero point

  • accuracy_level (int, optional) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel) Defaults to 0.

Returns:

MatMulFpQ4 or MatMulNBits node new_inits: initializers of the new node

Return type:

matmul_weight_only_node

neural_compressor.onnxrt.algorithms.weight_only.utility.prepare_inputs(model, data_reader, providers)[source]

Prepare inputs for weight only quantization.

Parameters:
  • model (ModelProto or ONNXModel) – onnx model.

  • data_reader (CalibrationDataReader) – a calibration data reader.

  • providers (list) – providers to use.

Returns:

prepared inputs. so: session options

Return type:

inputs

neural_compressor.onnxrt.algorithms.weight_only.utility.pad_tensor(weight, group_size, k_blocks)[source]

Pad tensor rowi so that it can be is divisible by group_size.

Parameters:
  • weight (array) – weight

  • group_size (int) – how many elements share one scale/zp

  • k_blocks (int) – the number of block

Returns:

paded weight

Return type:

weight

neural_compressor.onnxrt.algorithms.weight_only.utility.quant_tensor(data: numpy.array, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', dtype: str = 'int', ratio: float = 1.0)[source]

Quantize tensor per group.

Parameters:
  • data (np.array) – input weight

  • num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.

  • scheme (str, optional) – _quantization scheme. Defaults to “asym”.

  • dtype (str, optional) – data type. Defaults to “int”.

  • ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quantized weight scale: scale zero_point: zero point

Return type:

output

neural_compressor.onnxrt.algorithms.weight_only.utility.qdq_tensor(data: numpy.array, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', dtype: str = 'int', ratio: float = 1.0)[source]

Quant dequant tensor per group.

Parameters:
  • data (np.array) – input weight

  • num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to 32.

  • scheme (str, optional) – quantization scheme. Defaults to “asym”.

  • dtype (str, optional) – data type. Defaults to “int”.

  • ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quant-dequant weight

Return type:

output