neural_compressor.torch.algorithms.weight_only.utility

Module Contents

Classes

GraphTrace

Functions

quantize_4bit(tensor[, quantile, dtype, return_int])

Quantize tensor to NF4/FP4 data type.

qdq_weight_asym(weight[, bits, quantile, return_int])

Quant and dequant tensor with asym schema.

qdq_weight_sym(weight[, bits, quantile, return_int, ...])

Quant and dequant tensor with sym schema.

qdq_weight_actor(weight, bits, scheme[, quantile, ...])

Quant and dequant tensor per channel. It is an in-place op.

quant_tensor(weight[, bits, group_size, scheme, ...])

Quant and dequant tensor with group size. It's an in-place function.

search_clip(m[, bits, group_size, scheme, dtype, ...])

Search best clip range of each linear in current block. It's not an in-place function.

quant_weight_w_scale(weight, scale[, zp, group_size, ...])

Quant and dequant tensor with group size. It's an in-place function.

model_forward(model, dataloader, iters, device)

forward_wrapper(model, input[, device])

move_input_to_device(input[, device])

set_module(model, key, new_module)

Set new module into model by key name.

fetch_module(model, op_name)

Get module with a given op name.

get_absorb_layers(model, example_inputs[, ...])

Get absorb_to_layer and no_absorb_layer.

get_parent(node[, all_parents])

get_module(model, key)

Get module from model by key name.

get_block_prefix(model)

Get prefix and number of blocks.

get_example_input(dataloader[, i])

Get the example input.

replace_forward(model)

Replace forward to get the input args and kwargs of first block for AWQ algorithm.

recover_forward(model)

Recover model and block forward for AWQ algorithm.

get_module_input_output(model[, module_hook_config, ...])

A help function to get input and output tensor of modules in module_name_list.

Attributes

NF4

FP4_BNB

FP4_E2M1

NF4_BIT

FP4_BNB_BIT

FP4_E2M1_BIT

FLOAT_MAPPING

INT_MAPPING

neural_compressor.torch.algorithms.weight_only.utility.quantize_4bit(tensor, quantile=1.0, dtype='nf4', return_int=False, **kwargs)[source]

Quantize tensor to NF4/FP4 data type.

Parameters:
  • tensor – input tensor

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • dtype (str, optional) – data type. Defaults to ‘nf4’.

  • return_int (bool, optional) – whether return int data. Defaults to False.

Returns:

fake quantized tensor

Return type:

q_tensor

neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_asym(weight, bits=4, quantile=1.0, return_int=False, **kwargs)[source]

Quant and dequant tensor with asym schema.

Parameters:
  • weight – input weight

  • bits (int, optional) – bits. Defaults to 4.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

Returns:

qdq weight

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_sym(weight, bits=4, quantile=1.0, return_int=False, full_range=False, **kwargs)[source]

Quant and dequant tensor with sym schema.

Parameters:
  • weight – input weight

  • bits (int, optional) – bits. Defaults to 4.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

  • full_range (bool, optional) –

    Choose sym range whether use -2**(bits-1). For example: 4 bit

    scale = amax / 8 if full_range else amax / 7 If True, scale = -scale if abs(min)> abs(max) else scale Defaults to False.

Returns:

qdq weight

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_actor(weight, bits, scheme, quantile=1.0, dtype='int', return_int=False, full_range=False, **kwargs)[source]

Quant and dequant tensor per channel. It is an in-place op.

Parameters:
  • weight – input weight

  • bits (int, optional) – bits. Defaults to 4.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • dtype (str, optional) – select from int, nf4, fp4. Defaults to int.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

  • full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

qdq weight

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.quant_tensor(weight, bits=4, group_size=-1, scheme='asym', quantile=1.0, dtype='int', return_int=False, full_range=False, **kwargs)[source]

Quant and dequant tensor with group size. It’s an in-place function.

Parameters:
  • weight – input weight

  • bits (int, optional) – bits. Defaults to 4.

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.

  • scheme (str, optional) – sym or asym. Defaults to “asym”.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • dtype (str, optional) – select from int, nf4, fp4. Defaults to int.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

  • full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

qdq weight.

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.search_clip(m, bits=4, group_size=32, scheme='asym', dtype='int', enable_full_range=False)[source]

Search best clip range of each linear in current block. It’s not an in-place function.

Parameters:
  • m (torch.nn.Module) – torch module.

  • bits (int, optional) – num bits.

  • group_size (int, optional) – how many elements share one scale/zp.

  • scheme (str, optional) – sym or asym.

  • dtype (str, optional) – select from int, nf4, fp4. Defaults to int.

  • enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

best percentile of clip

Return type:

best_clip_ratio (float)

neural_compressor.torch.algorithms.weight_only.utility.quant_weight_w_scale(weight, scale, zp=None, group_size=-1, dtype='int')[source]

Quant and dequant tensor with group size. It’s an in-place function.

Parameters:
  • weight – input weight

  • scale – scale

  • zp – zero point

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.

  • dtype – data type, for NF4 FP4

Returns:

int weight.

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.set_module(model, key, new_module)[source]

Set new module into model by key name.

Parameters:
  • model (torch.nn.Module) – original model

  • key (str) – module name to be replaced

  • new_module (torch.nn.Module) – new module to be inserted

neural_compressor.torch.algorithms.weight_only.utility.fetch_module(model, op_name)[source]

Get module with a given op name.

Parameters:
  • model (object) – the input model.

  • op_name (str) – name of op.

Returns:

module (object).

neural_compressor.torch.algorithms.weight_only.utility.get_absorb_layers(model, example_inputs, supported_layers=['Linear'], folding=False)[source]

Get absorb_to_layer and no_absorb_layer.

Parameters:
  • model (torch.nn.Module) – input model

  • example_inputs – example_inputs

  • supported_layers (list, optional) – supported_layers. Defaults to [‘Linear’].

  • folding (bool, optional) – whether allow self-absorption. Defaults to False.

Returns:

dict of absorb_to_layer. eg. {absorb, [absorbed_1, xx]} no_absorb_layers: list of no_absorb_layers

Return type:

absorb_to_layer

neural_compressor.torch.algorithms.weight_only.utility.get_module(model, key)[source]

Get module from model by key name.

Parameters:
  • model (torch.nn.Module) – original model

  • key (str) – module name to be replaced

neural_compressor.torch.algorithms.weight_only.utility.get_block_prefix(model)[source]

Get prefix and number of blocks.

Parameters:

model (torch.nn.Module) – input model

Returns:

block_list name in model block_num(int): number of block in block_list

Return type:

block_prefix(str)

neural_compressor.torch.algorithms.weight_only.utility.get_example_input(dataloader, i=1)[source]

Get the example input.

Parameters:

dataloader (object) – calibration dataset.

Returns:

example_inp (object).

neural_compressor.torch.algorithms.weight_only.utility.replace_forward(model)[source]

Replace forward to get the input args and kwargs of first block for AWQ algorithm.

Parameters:

model (torch.nn.Module) – input model.

Raises:

ValueError – to avoid inference of rest parts in model.

Returns:

model with replaced forward.

Return type:

torch.nn.Module

neural_compressor.torch.algorithms.weight_only.utility.recover_forward(model)[source]

Recover model and block forward for AWQ algorithm.

Parameters:

model (torch.nn.Module) – input model.

Returns:

model with recovered forward.

Return type:

torch.nn.Module

neural_compressor.torch.algorithms.weight_only.utility.get_module_input_output(model, module_hook_config={}, dataloader=None, iters=-1, calib_func=None, input_func=None, output_func=None)[source]

A help function to get input and output tensor of modules in module_name_list.

Parameters:
  • model – torch model.

  • module_hook_config (dict, optional) –

    required module name for input/output. Defaults to {}. For example:

    module_hook_config = {

    ‘fc1’: [‘output’], ‘fc2’: [‘input’, ‘output’]

    }

  • dataloader – dataloader for model input.

  • iters – iterations for inference.

  • calib_func – a custom inference function to replace dataloader and iters.

  • input_func – preprocess input for less memory usage

  • output_func – preprocess output for less memory usage

Returns:

recorded input_values, output_values.
for example:
{‘fc1’:

{‘input’: [], ‘output’: []},

}

Return type:

total_values