`neural_compressor.torch.algorithms.weight_only.utility`

Module Contents

Classes

GraphTrace

Functions

`quantize_4bit`(tensor[, quantile, dtype, return_int])	Quantize tensor to NF4/FP4 data type.
`qdq_weight_asym`(weight[, bits, quantile, return_int])	Quant and dequant tensor with asym schema.
`qdq_weight_sym`(weight[, bits, quantile, return_int, ...])	Quant and dequant tensor with sym schema.
`qdq_weight_actor`(weight, bits, scheme[, quantile, ...])	Quant and dequant tensor per channel. It is an in-place op.
`quant_tensor`(weight[, bits, group_size, scheme, ...])	Quant and dequant tensor with group size. It's an in-place function.
`search_clip`(m[, bits, group_size, scheme, dtype, ...])	Search best clip range of each linear in current block. It's not an in-place function.
`quant_weight_w_scale`(weight, scale[, zp, group_size, ...])	Quant and dequant tensor with group size. It's an in-place function.
`model_forward`(model, dataloader, iters, device)
`forward_wrapper`(model, input[, device])
`move_input_to_device`(input[, device])
`set_module`(model, key, new_module)	Set new module into model by key name.
`fetch_module`(model, op_name)	Get module with a given op name.
`get_absorb_layers`(model, example_inputs[, ...])	Get absorb_to_layer and no_absorb_layer.
`get_parent`(node[, all_parents])
`get_module`(model, key)	Get module from model by key name.
`get_block_prefix`(model)	Get prefix and number of blocks.
`get_example_input`(dataloader[, i])	Get the example input.
`replace_forward`(model)	Replace forward to get the input args and kwargs of first block for AWQ algorithm.
`recover_forward`(model)	Recover model and block forward for AWQ algorithm.
`get_module_input_output`(model[, module_hook_config, ...])	A help function to get input and output tensor of modules in module_name_list.

Attributes

`NF4`
`FP4_BNB`
`FP4_E2M1`
`NF4_BIT`
`FP4_BNB_BIT`
`FP4_E2M1_BIT`
`FLOAT_MAPPING`
`INT_MAPPING`

neural_compressor.torch.algorithms.weight_only.utility.quantize_4bit(tensor, quantile=1.0, dtype='nf4', return_int=False, **kwargs)[source]

Quantize tensor to NF4/FP4 data type.

Parameters:

tensor – input tensor
quantile (float, optional) – percentile of clip. Defaults to 1.0.
dtype (str, optional) – data type. Defaults to ‘nf4’.
return_int (bool, optional) – whether return int data. Defaults to False.

Returns:

fake quantized tensor

Return type:

q_tensor

neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_asym(weight, bits=4, quantile=1.0, return_int=False, **kwargs)[source]

Quant and dequant tensor with asym schema.

Parameters:

weight – input weight
bits (int, optional) – bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

Returns:

qdq weight

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_sym(weight, bits=4, quantile=1.0, return_int=False, full_range=False, **kwargs)[source]

Quant and dequant tensor with sym schema.

Parameters:

weight – input weight
bits (int, optional) – bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) –
Choose sym range whether use -2**(bits-1). For example: 4 bit

scale = amax / 8 if full_range else amax / 7 If True, scale = -scale if abs(min)> abs(max) else scale Defaults to False.

Returns:

qdq weight

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_actor(weight, bits, scheme, quantile=1.0, dtype='int', return_int=False, full_range=False, **kwargs)[source]

Quant and dequant tensor per channel. It is an in-place op.

Parameters:

weight – input weight
bits (int, optional) – bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
dtype (str, optional) – select from int, nf4, fp4. Defaults to int.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

qdq weight

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.quant_tensor(weight, bits=4, group_size=-1, scheme='asym', quantile=1.0, dtype='int', return_int=False, full_range=False, **kwargs)[source]

Quant and dequant tensor with group size. It’s an in-place function.

Parameters:

weight – input weight
bits (int, optional) – bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.
scheme (str, optional) – sym or asym. Defaults to “asym”.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
dtype (str, optional) – select from int, nf4, fp4. Defaults to int.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

qdq weight.

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.search_clip(m, bits=4, group_size=32, scheme='asym', dtype='int', enable_full_range=False)[source]

Search best clip range of each linear in current block. It’s not an in-place function.

Parameters:

m (torch.nn.Module) – torch module.
bits (int, optional) – num bits.
group_size (int, optional) – how many elements share one scale/zp.
scheme (str, optional) – sym or asym.
dtype (str, optional) – select from int, nf4, fp4. Defaults to int.
enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

best percentile of clip

Return type:

best_clip_ratio (float)

neural_compressor.torch.algorithms.weight_only.utility.quant_weight_w_scale(weight, scale, zp=None, group_size=-1, dtype='int')[source]

Quant and dequant tensor with group size. It’s an in-place function.

Parameters:

weight – input weight
scale – scale
zp – zero point
group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.
dtype – data type, for NF4 FP4

Returns:

int weight.

Return type:

output

neural_compressor.torch.algorithms.weight_only.utility.set_module(model, key, new_module)[source]

Set new module into model by key name.

Parameters:

model (torch.nn.Module) – original model
key (str) – module name to be replaced
new_module (torch.nn.Module) – new module to be inserted

neural_compressor.torch.algorithms.weight_only.utility.fetch_module(model, op_name)[source]

Get module with a given op name.

Parameters:

model (object) – the input model.
op_name (str) – name of op.

Returns:

module (object).

neural_compressor.torch.algorithms.weight_only.utility.get_absorb_layers(model, example_inputs, supported_layers=['Linear'], folding=False)[source]

Get absorb_to_layer and no_absorb_layer.

Parameters:

model (torch.nn.Module) – input model
example_inputs – example_inputs
supported_layers (list, optional) – supported_layers. Defaults to [‘Linear’].
folding (bool, optional) – whether allow self-absorption. Defaults to False.

Returns:

dict of absorb_to_layer. eg. {absorb, [absorbed_1, xx]} no_absorb_layers: list of no_absorb_layers

Return type:

absorb_to_layer

neural_compressor.torch.algorithms.weight_only.utility.get_module(model, key)[source]

Get module from model by key name.

Parameters:

model (torch.nn.Module) – original model
key (str) – module name to be replaced

neural_compressor.torch.algorithms.weight_only.utility.get_block_prefix(model)[source]

Get prefix and number of blocks.

Parameters:: model (torch.nn.Module) – input model
Returns:: block_list name in model block_num(int): number of block in block_list
Return type:: block_prefix(str)

neural_compressor.torch.algorithms.weight_only.utility.get_example_input(dataloader, i=1)[source]

Get the example input.

Parameters:: dataloader (object) – calibration dataset.
Returns:: example_inp (object).

neural_compressor.torch.algorithms.weight_only.utility.replace_forward(model)[source]

Replace forward to get the input args and kwargs of first block for AWQ algorithm.

Parameters:: model (torch.nn.Module) – input model.
Raises:: ValueError – to avoid inference of rest parts in model.
Returns:: model with replaced forward.
Return type:: torch.nn.Module

neural_compressor.torch.algorithms.weight_only.utility.recover_forward(model)[source]

Recover model and block forward for AWQ algorithm.

Parameters:: model (torch.nn.Module) – input model.
Returns:: model with recovered forward.
Return type:: torch.nn.Module

neural_compressor.torch.algorithms.weight_only.utility.get_module_input_output(model, module_hook_config={}, dataloader=None, iters=-1, calib_func=None, input_func=None, output_func=None)[source]

A help function to get input and output tensor of modules in module_name_list.

Parameters:

model – torch model.
module_hook_config (dict, optional) –
required module name for input/output. Defaults to {}. For example:

module_hook_config = {
‘fc1’: [‘output’], ‘fc2’: [‘input’, ‘output’]

}
dataloader – dataloader for model input.
iters – iterations for inference.
calib_func – a custom inference function to replace dataloader and iters.
input_func – preprocess input for less memory usage
output_func – preprocess output for less memory usage

Returns:

recorded input_values, output_values.

for example:

{‘fc1’:: {‘input’: [], ‘output’: []},

}

Return type:

total_values

neural_compressor.torch.algorithms.weight_only.utility

Module Contents

Classes

Functions

Attributes

`neural_compressor.torch.algorithms.weight_only.utility`