neural_compressor.adaptor.ox_utils.weight_only

WeightOnly for onnxrt adaptor.

Module Contents

Functions

get_blob_size(group_size, has_zp)

Get blob_size.

make_matmul_weight_only_node(node, weight_shape, ...)

Build MatMulFpQ4 node.

quant_tensor(data[, num_bits, group_size, scheme, ...])

Quantize tensor per group.

qdq_tensor(data[, num_bits, group_size, scheme, ...])

Quant dequant tensor per group.

pad_tensor(weight, group_size, k_blocks)

Pad tensor rowi so that it can be is divisible by group_size.

rtn_quantize(model[, weight_config, num_bits, ...])

Quant the model with round to nearst method.

get_weight_scale(weight, group_size)

Get the scale of weight.

apply_awq_scale(model, weight_config, absorb_pairs, ...)

Apply scale for salient weight.

apply_awq_clip(model, weight_config, absorb_pairs, ...)

Apply clip for weight by checking mse.

prepare_inputs(model, n_samples, dataloader, providers)

Prepare inputs for weight only quantization.

awq_quantize(model, dataloader[, weight_config, ...])

Quant the model with Activation-aware Weight quantization(AWQ) method.

gptq(W, H[, num_bits, group_size, scheme, blocksize, ...])

Quant the weight with GPTQ method.

gptq_quantize(model, dataloader[, weight_config, ...])

Quant the model with GPTQ method.

neural_compressor.adaptor.ox_utils.weight_only.get_blob_size(group_size, has_zp)[source]

Get blob_size.

Parameters:
  • group_size (int) – how many elements share one scale/zp

  • has_zp (bool) – whether zero_point is None

neural_compressor.adaptor.ox_utils.weight_only.make_matmul_weight_only_node(node, weight_shape, num_bits, group_size, k_blocks, q_weight, scale, zero_point, accuracy_level=0)[source]

Build MatMulFpQ4 node.

Parameters:
  • node – original matmul node

  • weight_shape – original weight shape

  • num_bits (int) – num_bits

  • group_size (int) – how many elements share one scale/zp

  • k_blocks (int) – block number

  • q_weight (array) – quantized weight

  • scale (array) – scale

  • zero_point (array) – zero point

  • accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)

Returns:

MatMulFpQ4 or MatMulNBits node new_inits: initializers of the new node

Return type:

matmul_weight_only_node

neural_compressor.adaptor.ox_utils.weight_only.quant_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0)[source]

Quantize tensor per group.

Parameters:
  • data – input weight

  • num_bits (int, optional) – num_bits. Defaults to 4.

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.

  • scheme (str, optional) – quantization scheme. Defaults to “asym”.

  • dtype (str, optional) – data type. Defaults to “int”.

  • ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quantized weight scale: scale zero_point: zero point

Return type:

output

neural_compressor.adaptor.ox_utils.weight_only.qdq_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0)[source]

Quant dequant tensor per group.

Parameters:
  • data – input weight

  • num_bits (int, optional) – num_bits. Defaults to 4.

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to 4.

  • scheme (str, optional) – quantization scheme. Defaults to “asym”.

  • dtype (str, optional) – data type. Defaults to “int”.

  • ratio (float, optional) – percentile of clip. Defaults to 1.0.

Returns:

quant-dequant weight

Return type:

output

neural_compressor.adaptor.ox_utils.weight_only.pad_tensor(weight, group_size, k_blocks)[source]

Pad tensor rowi so that it can be is divisible by group_size.

Parameters:
  • weight (array) – weight

  • group_size (int) – how many elements share one scale/zp

  • k_blocks (int) – the number of block

Returns:

paded weight

Return type:

weight

neural_compressor.adaptor.ox_utils.weight_only.rtn_quantize(model, weight_config={}, num_bits=4, group_size=32, scheme='asym', ratios={}, accuracy_level=0, providers=['CPUExecutionProvider'])[source]

Quant the model with round to nearst method.

Parameters:
  • model (ModelProto or ONNXModel) – onnx model

  • weight_config (dict) –

    quantization config For example, weight_config = {

    ’fc2’:
    {

    ‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘RTN’

    }

    }

  • num_bits (int, optional) – num_bits. Default is 4.

  • group_size (int, optional) – how many elements share one scale/zp. Default is 32.

  • scheme (str, optional) – sym or asym. Defaults to “asym”.

  • ratios (dict, optional) – percentile of clip. Defaults to {}.

  • accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)

  • providers (list) – providers to use

Returns:

fake quantized ONNXModel

Return type:

model

neural_compressor.adaptor.ox_utils.weight_only.get_weight_scale(weight, group_size)[source]

Get the scale of weight.

neural_compressor.adaptor.ox_utils.weight_only.apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme)[source]

Apply scale for salient weight.

neural_compressor.adaptor.ox_utils.weight_only.apply_awq_clip(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme)[source]

Apply clip for weight by checking mse.

neural_compressor.adaptor.ox_utils.weight_only.prepare_inputs(model, n_samples, dataloader, providers)[source]

Prepare inputs for weight only quantization.

Parameters:
  • model (ModelProto or ONNXModel) – onnx model

  • n_samples (int, optional) – calibration sample number. -1 means all samples.

  • dataloader (object) – dataloader for calibration.

  • providers (list) – providers to use

Returns:

prepared inputs. so: session options

Return type:

inputs

neural_compressor.adaptor.ox_utils.weight_only.awq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, enable_auto_scale=True, enable_mse_search=True, accuracy_level=0, providers=['CPUExecutionProvider'])[source]

Quant the model with Activation-aware Weight quantization(AWQ) method.

Parameters:
  • model (ModelProto or ONNXModel) – onnx model

  • dataloader (object) – dataloader for calibration.

  • weight_config (dict) –

    quantization config For example, weight_config = {

    ’fc2’:
    {

    ‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘AWQ’

    }

    }

  • num_bits (int, optional) – num_bits. Default is 4.

  • group_size (int, optional) – how many elements share one scale/zp. Default is 32.

  • scheme (str, optional) – sym or asym. Defaults to “asym”.

  • n_samples (int, optional) – calibration sample number.

  • enable_auto_scale (bool, optional) – whether enable scale for salient weight. Defaults to True.

  • enable_mse_search (bool, optional) – whether enable clip for weight by checking mse. Defaults to True.

  • accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)

  • providers (list) – providers to use

Returns:

fake quantized ONNXModel

Return type:

model

neural_compressor.adaptor.ox_utils.weight_only.gptq(W, H, num_bits=4, group_size=32, scheme='asym', blocksize=128, percdamp=0.01, actorder=False, mse=False, perchannel=True)[source]

Quant the weight with GPTQ method.

Parameters:
  • W (array) – weight.

  • H (array) – Hessian matrix.

  • num_bits (int, optional) – num_bits. Default is 4.

  • group_size (int, optional) – how many elements share one scale/zp. Default is 32.

  • scheme (str, optional) – sym or asym. Defaults to “asym”.

  • blocksize (int, optional) – blocksize to quantize weight.

  • percdamp (float, optional) – percent of the average Hessian diagonal to use for dampening.

  • actorder (bool, optional) – whether rearrange Hessian matrix considering the diag’s value.

  • mse (bool, optional) – whether get scale and zero point with mse error.

  • perchannel (bool, optional) – whether quantize weight per-channel.

Returns:

fake quantized weight

Return type:

Q

neural_compressor.adaptor.ox_utils.weight_only.gptq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, percdamp=0.01, blocksize=128, actorder=False, mse=False, perchannel=True, accuracy_level=0, providers=['CPUExecutionProvider'])[source]

Quant the model with GPTQ method.

Parameters:
  • model (ModelProto or ONNXModel) – onnx model

  • dataloader (object) – dataloader for calibration.

  • weight_config (dict) –

    quantization config For example, weight_config = {

    ’fc2’:
    {

    ‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’, ‘algorithm’: ‘GPTQ’

    }

    }

  • num_bits (int, optional) – num_bits. Default is 4.

  • group_size (int, optional) – how many elements share one scale/zp. Default is 32.

  • scheme (str, optional) – sym or asym. Defaults to “asym”.

  • n_samples (int, optional) – calibration sample number.

  • percdamp (float, optional) – percent of the average Hessian diagonal to use for dampening.

  • blocksize (int, optional) – blocksize to quantize weight.

  • actorder (bool, optional) – whether rearrange Hessian matrix considering the diag’s value.

  • mse (bool, optional) – whether get scale and zero point with mse error.

  • perchannel (bool, optional) – whether quantize weight per-channel.

  • accuracy_level (int) – accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel)

  • providers (list) – providers to use

Returns:

fake quantized ONNXModel

Return type:

model