neural_compressor.adaptor.ox_utils.weight_only

WeightOnly for onnxrt adaptor.

Functions

`get_blob_size`(group_size, has_zp)	Get blob_size.
`make_matmul_weight_only_node`(node, weight_shape, ...)	Build MatMulFpQ4 node.
`quant_tensor`(data[, num_bits, group_size, scheme, ...])	Quantize tensor per group.
`qdq_tensor`(data[, num_bits, group_size, scheme, ...])	Quant dequant tensor per group.
`pad_tensor`(weight, group_size, k_blocks)	Pad tensor rowi so that it can be is divisible by group_size.
`rtn_quantize`(model[, weight_config, num_bits, ...])	Quant the model with round to nearst method.
`get_weight_scale`(weight, group_size)	Get the scale of weight.
`apply_awq_scale`(model, weight_config, absorb_pairs, ...)	Apply scale for salient weight.
`apply_awq_clip`(model, weight_config, absorb_pairs, ...)	Apply clip for weight by checking mse.
`prepare_inputs`(model, n_samples, dataloader, providers)	Prepare inputs for weight only quantization.
`awq_quantize`(model, dataloader[, weight_config, ...])	Quant the model with Activation-aware Weight quantization(AWQ) method.
`gptq`(W, H[, num_bits, group_size, scheme, blocksize, ...])	Quant the weight with GPTQ method.
`gptq_quantize`(model, dataloader[, weight_config, ...])	Quant the model with GPTQ method.

Module Contents

neural_compressor.adaptor.ox_utils.weight_only.get_blob_size(group_size, has_zp)[source]

Get blob_size.

Parameters:

group_size (int) – how many elements share one scale/zp
has_zp (bool) – whether zero_point is None

neural_compressor.adaptor.ox_utils.weight_only.make_matmul_weight_only_node(node, weight_shape, num_bits, group_size, k_blocks, q_weight, scale, zero_point, accuracy_level=0)[source]

Build MatMulFpQ4 node.

Parameters:

node – original matmul node
weight_shape – original weight shape
num_bits (int) – num_bits
group_size (int) – how many elements share one scale/zp
k_blocks (int) – block number
q_weight (array) – quantized weight
scale (array) – scale
zero_point (array) – zero point
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)

Returns:

MatMulFpQ4 or MatMulNBits node new_inits: initializers of the new node

Return type:

matmul_weight_only_node

neural_compressor.adaptor.ox_utils.weight_only.quant_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0)[source]

Quantize tensor per group.

Parameters:

data – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.
scheme (str, optional) – quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quantized weight scale: scale zero_point: zero point

Return type:

output

neural_compressor.adaptor.ox_utils.weight_only.qdq_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0)[source]

Quant dequant tensor per group.

Parameters:

data – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.
scheme (str, optional) – quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quant-dequant weight

Return type:

output

neural_compressor.adaptor.ox_utils.weight_only.pad_tensor(weight, group_size, k_blocks)[source]

Pad tensor rowi so that it can be is divisible by group_size.

Parameters:

weight (array) – weight
group_size (int) – how many elements share one scale/zp
k_blocks (int) – the number of block

Returns:

paded weight

Return type:

weight

neural_compressor.adaptor.ox_utils.weight_only.rtn_quantize(model, weight_config={}, num_bits=4, group_size=32, scheme='asym', ratios={}, accuracy_level=0, providers=['CPUExecutionProvider'])[source]

Quant the model with round to nearst method.

Parameters:

model (ModelProto or ONNXModel) – onnx model
weight_config (dict) –
quantization config For example, weight_config = {

’fc2’:

{
‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘RTN’

}

}
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
ratios (dict, optional) – percentile of clip. Defaults to {}.
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)
providers (list) – providers to use

Returns:

fake quantized ONNXModel

Return type:

model

neural_compressor.adaptor.ox_utils.weight_only.get_weight_scale(weight, group_size)[source]: Get the scale of weight.

neural_compressor.adaptor.ox_utils.weight_only.apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme)[source]: Apply scale for salient weight.

neural_compressor.adaptor.ox_utils.weight_only.apply_awq_clip(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme)[source]: Apply clip for weight by checking mse.

neural_compressor.adaptor.ox_utils.weight_only.prepare_inputs(model, n_samples, dataloader, providers)[source]

Prepare inputs for weight only quantization.

Parameters:

model (ModelProto or ONNXModel) – onnx model
n_samples (int, optional) – calibration sample number. -1 means all samples.
dataloader (object) – dataloader for calibration.
providers (list) – providers to use

Returns:

prepared inputs. so: session options

Return type:

inputs

neural_compressor.adaptor.ox_utils.weight_only.awq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, enable_auto_scale=True, enable_mse_search=True, accuracy_level=0, providers=['CPUExecutionProvider'])[source]

Quant the model with Activation-aware Weight quantization(AWQ) method.

Parameters:

model (ModelProto or ONNXModel) – onnx model
dataloader (object) – dataloader for calibration.
weight_config (dict) –
quantization config For example, weight_config = {

’fc2’:

{
‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘AWQ’

}

}
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
n_samples (int, optional) – calibration sample number.
enable_auto_scale (bool, optional) – whether enable scale for salient weight. Defaults to True.
enable_mse_search (bool, optional) – whether enable clip for weight by checking mse. Defaults to True.
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)
providers (list) – providers to use

Returns:

fake quantized ONNXModel

Return type:

model

neural_compressor.adaptor.ox_utils.weight_only.gptq(W, H, num_bits=4, group_size=32, scheme='asym', blocksize=128, percdamp=0.01, actorder=False, mse=False, perchannel=True)[source]

Quant the weight with GPTQ method.

Parameters:

W (array) – weight.
H (array) – Hessian matrix.
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
blocksize (int, optional) – blocksize to quantize weight.
percdamp (float, optional) – percent of the average Hessian diagonal to use for dampening.
actorder (bool, optional) – whether rearrange Hessian matrix considering the diag’s value.
mse (bool, optional) – whether get scale and zero point with mse error.
perchannel (bool, optional) – whether quantize weight per-channel.

Returns:

fake quantized weight

Return type:

Q

neural_compressor.adaptor.ox_utils.weight_only.gptq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, percdamp=0.01, blocksize=128, actorder=False, mse=False, perchannel=True, accuracy_level=0, providers=['CPUExecutionProvider'])[source]

Quant the model with GPTQ method.

Parameters:

model (ModelProto or ONNXModel) – onnx model
dataloader (object) – dataloader for calibration.
weight_config (dict) –
quantization config For example, weight_config = {

’fc2’:

{
‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘GPTQ’

}

}
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
n_samples (int, optional) – calibration sample number.
percdamp (float, optional) – percent of the average Hessian diagonal to use for dampening.
blocksize (int, optional) – blocksize to quantize weight.
actorder (bool, optional) – whether rearrange Hessian matrix considering the diag’s value.
mse (bool, optional) – whether get scale and zero point with mse error.
perchannel (bool, optional) – whether quantize weight per-channel.
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)
providers (list) – providers to use

Returns:

fake quantized ONNXModel

Return type:

model