neural_compressor.adaptor.ox_utils.weight_only
WeightOnly for onnxrt adaptor.
Functions
|
Get blob_size. |
|
Build MatMulFpQ4 node. |
|
Quantize tensor per group. |
|
Quant dequant tensor per group. |
|
Pad tensor rowi so that it can be is divisible by group_size. |
|
Quant the model with round to nearst method. |
|
Get the scale of weight. |
|
Apply scale for salient weight. |
|
Apply clip for weight by checking mse. |
|
Prepare inputs for weight only quantization. |
|
Quant the model with Activation-aware Weight quantization(AWQ) method. |
|
Quant the weight with GPTQ method. |
|
Quant the model with GPTQ method. |
Module Contents
- neural_compressor.adaptor.ox_utils.weight_only.get_blob_size(group_size, has_zp)[source]
Get blob_size.
- Parameters:
group_size (int) – how many elements share one scale/zp
has_zp (bool) – whether zero_point is None
- neural_compressor.adaptor.ox_utils.weight_only.make_matmul_weight_only_node(node, weight_shape, num_bits, group_size, k_blocks, q_weight, scale, zero_point, accuracy_level=0)[source]
Build MatMulFpQ4 node.
- Parameters:
node – original matmul node
weight_shape – original weight shape
num_bits (int) – num_bits
group_size (int) – how many elements share one scale/zp
k_blocks (int) – block number
q_weight (array) – quantized weight
scale (array) – scale
zero_point (array) – zero point
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)
- Returns:
MatMulFpQ4 or MatMulNBits node new_inits: initializers of the new node
- Return type:
matmul_weight_only_node
- neural_compressor.adaptor.ox_utils.weight_only.quant_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0)[source]
Quantize tensor per group.
- Parameters:
data – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.
scheme (str, optional) – quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.
- Returns:
quantized weight scale: scale zero_point: zero point
- Return type:
output
- neural_compressor.adaptor.ox_utils.weight_only.qdq_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0)[source]
Quant dequant tensor per group.
- Parameters:
data – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.
scheme (str, optional) – quantization scheme. Defaults to “asym”.
dtype (str, optional) – data type. Defaults to “int”.
ratio (float, optional) – percentile of clip. Defaults to 1.0.
- Returns:
quant-dequant weight
- Return type:
output
- neural_compressor.adaptor.ox_utils.weight_only.pad_tensor(weight, group_size, k_blocks)[source]
Pad tensor rowi so that it can be is divisible by group_size.
- Parameters:
weight (array) – weight
group_size (int) – how many elements share one scale/zp
k_blocks (int) – the number of block
- Returns:
paded weight
- Return type:
weight
- neural_compressor.adaptor.ox_utils.weight_only.rtn_quantize(model, weight_config={}, num_bits=4, group_size=32, scheme='asym', ratios={}, accuracy_level=0, providers=['CPUExecutionProvider'])[source]
Quant the model with round to nearst method.
- Parameters:
model (ModelProto or ONNXModel) – onnx model
weight_config (dict) –
quantization config For example, weight_config = {
- ’fc2’:
- {
‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘RTN’
}
}
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
ratios (dict, optional) – percentile of clip. Defaults to {}.
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)
providers (list) – providers to use
- Returns:
fake quantized ONNXModel
- Return type:
model
- neural_compressor.adaptor.ox_utils.weight_only.get_weight_scale(weight, group_size)[source]
Get the scale of weight.
- neural_compressor.adaptor.ox_utils.weight_only.apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme)[source]
Apply scale for salient weight.
- neural_compressor.adaptor.ox_utils.weight_only.apply_awq_clip(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme)[source]
Apply clip for weight by checking mse.
- neural_compressor.adaptor.ox_utils.weight_only.prepare_inputs(model, n_samples, dataloader, providers)[source]
Prepare inputs for weight only quantization.
- Parameters:
model (ModelProto or ONNXModel) – onnx model
n_samples (int, optional) – calibration sample number. -1 means all samples.
dataloader (object) – dataloader for calibration.
providers (list) – providers to use
- Returns:
prepared inputs. so: session options
- Return type:
inputs
- neural_compressor.adaptor.ox_utils.weight_only.awq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, enable_auto_scale=True, enable_mse_search=True, accuracy_level=0, providers=['CPUExecutionProvider'])[source]
Quant the model with Activation-aware Weight quantization(AWQ) method.
- Parameters:
model (ModelProto or ONNXModel) – onnx model
dataloader (object) – dataloader for calibration.
weight_config (dict) –
quantization config For example, weight_config = {
- ’fc2’:
- {
‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘AWQ’
}
}
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
n_samples (int, optional) – calibration sample number.
enable_auto_scale (bool, optional) – whether enable scale for salient weight. Defaults to True.
enable_mse_search (bool, optional) – whether enable clip for weight by checking mse. Defaults to True.
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)
providers (list) – providers to use
- Returns:
fake quantized ONNXModel
- Return type:
model
- neural_compressor.adaptor.ox_utils.weight_only.gptq(W, H, num_bits=4, group_size=32, scheme='asym', blocksize=128, percdamp=0.01, actorder=False, mse=False, perchannel=True)[source]
Quant the weight with GPTQ method.
- Parameters:
W (array) – weight.
H (array) – Hessian matrix.
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
blocksize (int, optional) – blocksize to quantize weight.
percdamp (float, optional) – percent of the average Hessian diagonal to use for dampening.
actorder (bool, optional) – whether rearrange Hessian matrix considering the diag’s value.
mse (bool, optional) – whether get scale and zero point with mse error.
perchannel (bool, optional) – whether quantize weight per-channel.
- Returns:
fake quantized weight
- Return type:
Q
- neural_compressor.adaptor.ox_utils.weight_only.gptq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, percdamp=0.01, blocksize=128, actorder=False, mse=False, perchannel=True, accuracy_level=0, providers=['CPUExecutionProvider'])[source]
Quant the model with GPTQ method.
- Parameters:
model (ModelProto or ONNXModel) – onnx model
dataloader (object) – dataloader for calibration.
weight_config (dict) –
quantization config For example, weight_config = {
- ’fc2’:
- {
‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘GPTQ’
}
}
num_bits (int, optional) – num_bits. Default is 4.
group_size (int, optional) – how many elements share one scale/zp. Default is 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
n_samples (int, optional) – calibration sample number.
percdamp (float, optional) – percent of the average Hessian diagonal to use for dampening.
blocksize (int, optional) – blocksize to quantize weight.
actorder (bool, optional) – whether rearrange Hessian matrix considering the diag’s value.
mse (bool, optional) – whether get scale and zero point with mse error.
perchannel (bool, optional) – whether quantize weight per-channel.
accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)
providers (list) – providers to use
- Returns:
fake quantized ONNXModel
- Return type:
model