neural_compressor.onnxrt.algorithms.weight_only.utility
Module Contents
Functions
|
Build MatMulFpQ4/MatMulNBits node. |
|
Prepare inputs for weight only quantization. |
|
Pad tensor rowi so that it can be is divisible by group_size. |
|
Quantize tensor per group. |
|
Quant dequant tensor per group. |
- neural_compressor.onnxrt.algorithms.weight_only.utility.make_matmul_weight_only_node(node: onnx.NodeProto, weight_shape: tuple, num_bits: int, group_size: int, k_blocks: int, q_weight: numpy.array, scale: numpy.array, zero_point: numpy.array, accuracy_level: int = 0)[source]
Build MatMulFpQ4/MatMulNBits node.
- Parameters:
node (onnx.NodeProto) – original matmul node
weight_shape (tuple) – original weight shape
num_bits (int) – number of bits used to represent weights.
group_size (int) – how many elements share one scale/zp
k_blocks (int) – block number
q_weight (np.array) – quantized weight
scale (np.array) – scale
zero_point (np.array) – zero point
accuracy_level (int, optional) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel) Defaults to 0.
- Returns:
MatMulFpQ4 or MatMulNBits node new_inits: initializers of the new node
- Return type:
matmul_weight_only_node
- neural_compressor.onnxrt.algorithms.weight_only.utility.prepare_inputs(model, data_reader, providers)[source]
Prepare inputs for weight only quantization.
- Parameters:
model (ModelProto or ONNXModel) – onnx model.
data_reader (CalibrationDataReader) – a calibration data reader.
providers (list) – providers to use.
- Returns:
prepared inputs. so: session options
- Return type:
inputs
- neural_compressor.onnxrt.algorithms.weight_only.utility.pad_tensor(weight, group_size, k_blocks)[source]
Pad tensor rowi so that it can be is divisible by group_size.
- Parameters:
weight (array) – weight
group_size (int) – how many elements share one scale/zp
k_blocks (int) – the number of block
- Returns:
paded weight
- Return type:
weight
- neural_compressor.onnxrt.algorithms.weight_only.utility.quant_tensor(data: numpy.array, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', dtype: str = 'int', ratio: float = 1.0)[source]
Quantize tensor per group.
- Parameters:
data (np.array) – input weight
num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.
scheme (str, optional) – _quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.
- Returns:
quantized weight scale: scale zero_point: zero point
- Return type:
output
- neural_compressor.onnxrt.algorithms.weight_only.utility.qdq_tensor(data: numpy.array, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', dtype: str = 'int', ratio: float = 1.0)[source]
Quant dequant tensor per group.
- Parameters:
data (np.array) – input weight
num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 32.
scheme (str, optional) – quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.
- Returns:
quant-dequant weight
- Return type:
output