`neural_compressor.onnxrt.algorithms.weight_only.utility`

Module Contents

Functions

`make_matmul_weight_only_node`(node, weight_shape, ...)	Build MatMulFpQ4/MatMulNBits node.
`prepare_inputs`(model, data_reader, providers)	Prepare inputs for weight only quantization.
`pad_tensor`(weight, group_size, k_blocks)	Pad tensor rowi so that it can be is divisible by group_size.
`quant_tensor`(data[, num_bits, group_size, scheme, ...])	Quantize tensor per group.
`qdq_tensor`(data[, num_bits, group_size, scheme, ...])	Quant dequant tensor per group.

neural_compressor.onnxrt.algorithms.weight_only.utility.make_matmul_weight_only_node(node: onnx.NodeProto, weight_shape: tuple, num_bits: int, group_size: int, k_blocks: int, q_weight: numpy.array, scale: numpy.array, zero_point: numpy.array, accuracy_level: int = 0)[source]

Build MatMulFpQ4/MatMulNBits node.

Parameters:

node (onnx.NodeProto) – original matmul node
weight_shape (tuple) – original weight shape
num_bits (int) – number of bits used to represent weights.
group_size (int) – how many elements share one scale/zp
k_blocks (int) – block number
q_weight (np.array) – quantized weight
scale (np.array) – scale
zero_point (np.array) – zero point
accuracy_level (int, optional) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel) Defaults to 0.

Returns:

MatMulFpQ4 or MatMulNBits node new_inits: initializers of the new node

Return type:

matmul_weight_only_node

neural_compressor.onnxrt.algorithms.weight_only.utility.prepare_inputs(model, data_reader, providers)[source]

Prepare inputs for weight only quantization.

Parameters:

model (ModelProto or ONNXModel) – onnx model.
data_reader (CalibrationDataReader) – a calibration data reader.
providers (list) – providers to use.

Returns:

prepared inputs. so: session options

Return type:

inputs

neural_compressor.onnxrt.algorithms.weight_only.utility.pad_tensor(weight, group_size, k_blocks)[source]

Pad tensor rowi so that it can be is divisible by group_size.

Parameters:

weight (array) – weight
group_size (int) – how many elements share one scale/zp
k_blocks (int) – the number of block

Returns:

paded weight

Return type:

weight

neural_compressor.onnxrt.algorithms.weight_only.utility.quant_tensor(data: numpy.array, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', dtype: str = 'int', ratio: float = 1.0)[source]

Quantize tensor per group.

Parameters:

data (np.array) – input weight
num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.
scheme (str, optional) – _quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quantized weight scale: scale zero_point: zero point

Return type:

output

neural_compressor.onnxrt.algorithms.weight_only.utility.qdq_tensor(data: numpy.array, num_bits: int = 4, group_size: int = 32, scheme: str = 'asym', dtype: str = 'int', ratio: float = 1.0)[source]

Quant dequant tensor per group.

Parameters:

data (np.array) – input weight
num_bits (int, optional) – number of bits used to represent weights. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 32.
scheme (str, optional) – quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quant-dequant weight

Return type:

output

neural_compressor.onnxrt.algorithms.weight_only.utility

Module Contents

Functions

`neural_compressor.onnxrt.algorithms.weight_only.utility`