:py:mod:`neural_compressor.adaptor.ox_utils.weight_only` ======================================================== .. py:module:: neural_compressor.adaptor.ox_utils.weight_only .. autoapi-nested-parse:: WeightOnly for onnxrt adaptor. Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: neural_compressor.adaptor.ox_utils.weight_only.get_blob_size neural_compressor.adaptor.ox_utils.weight_only.make_matmul_weight_only_node neural_compressor.adaptor.ox_utils.weight_only.quant_tensor neural_compressor.adaptor.ox_utils.weight_only.qdq_tensor neural_compressor.adaptor.ox_utils.weight_only.pad_tensor neural_compressor.adaptor.ox_utils.weight_only.rtn_quantize neural_compressor.adaptor.ox_utils.weight_only.get_weight_scale neural_compressor.adaptor.ox_utils.weight_only.apply_awq_scale neural_compressor.adaptor.ox_utils.weight_only.apply_awq_clip neural_compressor.adaptor.ox_utils.weight_only.prepare_inputs neural_compressor.adaptor.ox_utils.weight_only.awq_quantize neural_compressor.adaptor.ox_utils.weight_only.gptq neural_compressor.adaptor.ox_utils.weight_only.gptq_quantize .. py:function:: get_blob_size(group_size, has_zp) Get blob_size. :param group_size: how many elements share one scale/zp :type group_size: int :param has_zp: whether zero_point is None :type has_zp: bool .. py:function:: make_matmul_weight_only_node(node, weight_shape, num_bits, group_size, k_blocks, q_weight, scale, zero_point, accuracy_level=0) Build MatMulFpQ4 node. :param node: original matmul node :param weight_shape: original weight shape :param num_bits: num_bits :type num_bits: int :param group_size: how many elements share one scale/zp :type group_size: int :param k_blocks: block number :type k_blocks: int :param q_weight: quantized weight :type q_weight: array :param scale: scale :type scale: array :param zero_point: zero point :type zero_point: array :param accuracy_level: accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel) :type accuracy_level: int :returns: MatMulFpQ4 or MatMulNBits node new_inits: initializers of the new node :rtype: matmul_weight_only_node .. py:function:: quant_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0) Quantize tensor per group. :param data: input weight :param num_bits: num_bits. Defaults to 4. :type num_bits: int, optional :param group_size: how many elements share one scale/zp. Defaults to 4. :type group_size: int, optional :param scheme: quantization scheme. Defaults to "asym". :type scheme: str, optional :param dtype: data type. Defaults to "int". :type dtype: str, optional :param ratio: percentile of clip. Defaults to 1.0. :type ratio: float, optional :returns: quantized weight scale: scale zero_point: zero point :rtype: output .. py:function:: qdq_tensor(data, num_bits=4, group_size=32, scheme='asym', dtype='int', ratio=1.0) Quant dequant tensor per group. :param data: input weight :param num_bits: num_bits. Defaults to 4. :type num_bits: int, optional :param group_size: how many elements share one scale/zp. Defaults to 4. :type group_size: int, optional :param scheme: quantization scheme. Defaults to "asym". :type scheme: str, optional :param dtype: data type. Defaults to "int". :type dtype: str, optional :param ratio: percentile of clip. Defaults to 1.0. :type ratio: float, optional :returns: quant-dequant weight :rtype: output .. py:function:: pad_tensor(weight, group_size, k_blocks) Pad tensor rowi so that it can be is divisible by group_size. :param weight: weight :type weight: array :param group_size: how many elements share one scale/zp :type group_size: int :param k_blocks: the number of block :type k_blocks: int :returns: paded weight :rtype: weight .. py:function:: rtn_quantize(model, weight_config={}, num_bits=4, group_size=32, scheme='asym', ratios={}, accuracy_level=0, providers=['CPUExecutionProvider']) Quant the model with round to nearst method. :param model: onnx model :type model: ModelProto or ONNXModel :param weight_config: quantization config For example, weight_config = { 'fc2': { 'bits': 4, 'group_size': 32, 'scheme': 'sym', 'algorithm': 'RTN' } } :type weight_config: dict :param num_bits: num_bits. Default is 4. :type num_bits: int, optional :param group_size: how many elements share one scale/zp. Default is 32. :type group_size: int, optional :param scheme: sym or asym. Defaults to "asym". :type scheme: str, optional :param ratios: percentile of clip. Defaults to {}. :type ratios: dict, optional :param accuracy_level: accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel) :type accuracy_level: int :param providers: providers to use :type providers: list :returns: fake quantized ONNXModel :rtype: model .. py:function:: get_weight_scale(weight, group_size) Get the scale of weight. .. py:function:: apply_awq_scale(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme) Apply scale for salient weight. .. py:function:: apply_awq_clip(model, weight_config, absorb_pairs, output_dicts, num_bits, group_size, scheme) Apply clip for weight by checking mse. .. py:function:: prepare_inputs(model, n_samples, dataloader, providers) Prepare inputs for weight only quantization. :param model: onnx model :type model: ModelProto or ONNXModel :param n_samples: calibration sample number. -1 means all samples. :type n_samples: int, optional :param dataloader: dataloader for calibration. :type dataloader: object :param providers: providers to use :type providers: list :returns: prepared inputs. so: session options :rtype: inputs .. py:function:: awq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, enable_auto_scale=True, enable_mse_search=True, accuracy_level=0, providers=['CPUExecutionProvider']) Quant the model with Activation-aware Weight quantization(AWQ) method. :param model: onnx model :type model: ModelProto or ONNXModel :param dataloader: dataloader for calibration. :type dataloader: object :param weight_config: quantization config For example, weight_config = { 'fc2': { 'bits': 4, 'group_size': 32, 'scheme': 'sym', 'algorithm': 'AWQ' } } :type weight_config: dict :param num_bits: num_bits. Default is 4. :type num_bits: int, optional :param group_size: how many elements share one scale/zp. Default is 32. :type group_size: int, optional :param scheme: sym or asym. Defaults to "asym". :type scheme: str, optional :param n_samples: calibration sample number. :type n_samples: int, optional :param enable_auto_scale: whether enable scale for salient weight. Defaults to True. :type enable_auto_scale: bool, optional :param enable_mse_search: whether enable clip for weight by checking mse. Defaults to True. :type enable_mse_search: bool, optional :param accuracy_level: accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel) :type accuracy_level: int :param providers: providers to use :type providers: list :returns: fake quantized ONNXModel :rtype: model .. py:function:: gptq(W, H, num_bits=4, group_size=32, scheme='asym', blocksize=128, percdamp=0.01, actorder=False, mse=False, perchannel=True) Quant the weight with GPTQ method. :param W: weight. :type W: array :param H: Hessian matrix. :type H: array :param num_bits: num_bits. Default is 4. :type num_bits: int, optional :param group_size: how many elements share one scale/zp. Default is 32. :type group_size: int, optional :param scheme: sym or asym. Defaults to "asym". :type scheme: str, optional :param blocksize: blocksize to quantize weight. :type blocksize: int, optional :param percdamp: percent of the average Hessian diagonal to use for dampening. :type percdamp: float, optional :param actorder: whether rearrange Hessian matrix considering the diag's value. :type actorder: bool, optional :param mse: whether get scale and zero point with mse error. :type mse: bool, optional :param perchannel: whether quantize weight per-channel. :type perchannel: bool, optional :returns: fake quantized weight :rtype: Q .. py:function:: gptq_quantize(model, dataloader, weight_config={}, num_bits=4, group_size=32, scheme='asym', n_samples=128, percdamp=0.01, blocksize=128, actorder=False, mse=False, perchannel=True, accuracy_level=0, providers=['CPUExecutionProvider']) Quant the model with GPTQ method. :param model: onnx model :type model: ModelProto or ONNXModel :param dataloader: dataloader for calibration. :type dataloader: object :param weight_config: quantization config For example, weight_config = { 'fc2': { 'bits': 4, 'group_size': 32, 'scheme': 'sym', 'algorithm': 'GPTQ' } } :type weight_config: dict :param num_bits: num_bits. Default is 4. :type num_bits: int, optional :param group_size: how many elements share one scale/zp. Default is 32. :type group_size: int, optional :param scheme: sym or asym. Defaults to "asym". :type scheme: str, optional :param n_samples: calibration sample number. :type n_samples: int, optional :param percdamp: percent of the average Hessian diagonal to use for dampening. :type percdamp: float, optional :param blocksize: blocksize to quantize weight. :type blocksize: int, optional :param actorder: whether rearrange Hessian matrix considering the diag's value. :type actorder: bool, optional :param mse: whether get scale and zero point with mse error. :type mse: bool, optional :param perchannel: whether quantize weight per-channel. :type perchannel: bool, optional :param accuracy_level: accuracy level. Support 0 (unset), 1(fp32 compute type of jblas kernel), 2 (fp16 compute type of jblas kernel), 3 (bf16 compute type of jblas kernel), 4 (int8 compute type of jblas kernel) :type accuracy_level: int :param providers: providers to use :type providers: list :returns: fake quantized ONNXModel :rtype: model