neural_compressor.torch.algorithms.weight_only.utility
Module Contents
Classes
|
Functions
|
Quantize tensor to NF4/FP4 data type. |
|
Quant and dequant tensor with asym schema. |
|
Quant and dequant tensor with sym schema. |
|
Quant and dequant tensor per channel. It is an in-place op. |
|
Quant and dequant tensor with group size. It's an in-place function. |
|
Search best clip range of each linear in current block. It's not an in-place function. |
|
Quant and dequant tensor with group size. It's an in-place function. |
|
|
|
|
|
|
|
Set new module into model by key name. |
|
Get module with a given op name. |
|
Get absorb_to_layer and no_absorb_layer. |
|
|
|
Get module from model by key name. |
|
Get prefix and number of blocks. |
|
Get the example input. |
|
Replace forward to get the input args and kwargs of first block for AWQ algorithm. |
|
Recover model and block forward for AWQ algorithm. |
|
A help function to get input and output tensor of modules in module_name_list. |
Attributes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- neural_compressor.torch.algorithms.weight_only.utility.quantize_4bit(tensor, quantile=1.0, dtype='nf4', return_int=False, **kwargs)[source]
Quantize tensor to NF4/FP4 data type.
- Parameters:
tensor – input tensor
quantile (float, optional) – percentile of clip. Defaults to 1.0.
dtype (str, optional) – data type. Defaults to ‘nf4’.
return_int (bool, optional) – whether return int data. Defaults to False.
- Returns:
fake quantized tensor
- Return type:
q_tensor
- neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_asym(weight, bits=4, quantile=1.0, return_int=False, **kwargs)[source]
Quant and dequant tensor with asym schema.
- Parameters:
weight – input weight
bits (int, optional) – bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
- Returns:
qdq weight
- Return type:
output
- neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_sym(weight, bits=4, quantile=1.0, return_int=False, full_range=False, **kwargs)[source]
Quant and dequant tensor with sym schema.
- Parameters:
weight – input weight
bits (int, optional) – bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) –
Choose sym range whether use -2**(bits-1). For example: 4 bit
scale = amax / 8 if full_range else amax / 7 If True, scale = -scale if abs(min)> abs(max) else scale Defaults to False.
- Returns:
qdq weight
- Return type:
output
- neural_compressor.torch.algorithms.weight_only.utility.qdq_weight_actor(weight, bits, scheme, quantile=1.0, dtype='int', return_int=False, full_range=False, **kwargs)[source]
Quant and dequant tensor per channel. It is an in-place op.
- Parameters:
weight – input weight
bits (int, optional) – bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
dtype (str, optional) – select from int, nf4, fp4. Defaults to int.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) – Choose sym range whether use -2**(bits-1).
- Returns:
qdq weight
- Return type:
output
- neural_compressor.torch.algorithms.weight_only.utility.quant_tensor(weight, bits=4, group_size=-1, scheme='asym', quantile=1.0, dtype='int', return_int=False, full_range=False, **kwargs)[source]
Quant and dequant tensor with group size. It’s an in-place function.
- Parameters:
weight – input weight
bits (int, optional) – bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.
scheme (str, optional) – sym or asym. Defaults to “asym”.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
dtype (str, optional) – select from int, nf4, fp4. Defaults to int.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) – Choose sym range whether use -2**(bits-1).
- Returns:
qdq weight.
- Return type:
output
- neural_compressor.torch.algorithms.weight_only.utility.search_clip(m, bits=4, group_size=32, scheme='asym', dtype='int', enable_full_range=False)[source]
Search best clip range of each linear in current block. It’s not an in-place function.
- Parameters:
m (torch.nn.Module) – torch module.
bits (int, optional) – num bits.
group_size (int, optional) – how many elements share one scale/zp.
scheme (str, optional) – sym or asym.
dtype (str, optional) – select from int, nf4, fp4. Defaults to int.
enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1).
- Returns:
best percentile of clip
- Return type:
best_clip_ratio (float)
- neural_compressor.torch.algorithms.weight_only.utility.quant_weight_w_scale(weight, scale, zp=None, group_size=-1, dtype='int')[source]
Quant and dequant tensor with group size. It’s an in-place function.
- Parameters:
weight – input weight
scale – scale
zp – zero point
group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.
dtype – data type, for NF4 FP4
- Returns:
int weight.
- Return type:
output
- neural_compressor.torch.algorithms.weight_only.utility.set_module(model, key, new_module)[source]
Set new module into model by key name.
- Parameters:
model (torch.nn.Module) – original model
key (str) – module name to be replaced
new_module (torch.nn.Module) – new module to be inserted
- neural_compressor.torch.algorithms.weight_only.utility.fetch_module(model, op_name)[source]
Get module with a given op name.
- Parameters:
model (object) – the input model.
op_name (str) – name of op.
- Returns:
module (object).
- neural_compressor.torch.algorithms.weight_only.utility.get_absorb_layers(model, example_inputs, supported_layers=['Linear'], folding=False)[source]
Get absorb_to_layer and no_absorb_layer.
- Parameters:
model (torch.nn.Module) – input model
example_inputs – example_inputs
supported_layers (list, optional) – supported_layers. Defaults to [‘Linear’].
folding (bool, optional) – whether allow self-absorption. Defaults to False.
- Returns:
dict of absorb_to_layer. eg. {absorb, [absorbed_1, xx]} no_absorb_layers: list of no_absorb_layers
- Return type:
absorb_to_layer
- neural_compressor.torch.algorithms.weight_only.utility.get_module(model, key)[source]
Get module from model by key name.
- Parameters:
model (torch.nn.Module) – original model
key (str) – module name to be replaced
- neural_compressor.torch.algorithms.weight_only.utility.get_block_prefix(model)[source]
Get prefix and number of blocks.
- Parameters:
model (torch.nn.Module) – input model
- Returns:
block_list name in model block_num(int): number of block in block_list
- Return type:
block_prefix(str)
- neural_compressor.torch.algorithms.weight_only.utility.get_example_input(dataloader, i=1)[source]
Get the example input.
- Parameters:
dataloader (object) – calibration dataset.
- Returns:
example_inp (object).
- neural_compressor.torch.algorithms.weight_only.utility.replace_forward(model)[source]
Replace forward to get the input args and kwargs of first block for AWQ algorithm.
- Parameters:
model (torch.nn.Module) – input model.
- Raises:
ValueError – to avoid inference of rest parts in model.
- Returns:
model with replaced forward.
- Return type:
torch.nn.Module
- neural_compressor.torch.algorithms.weight_only.utility.recover_forward(model)[source]
Recover model and block forward for AWQ algorithm.
- Parameters:
model (torch.nn.Module) – input model.
- Returns:
model with recovered forward.
- Return type:
torch.nn.Module
- neural_compressor.torch.algorithms.weight_only.utility.get_module_input_output(model, module_hook_config={}, dataloader=None, iters=-1, calib_func=None, input_func=None, output_func=None)[source]
A help function to get input and output tensor of modules in module_name_list.
- Parameters:
model – torch model.
module_hook_config (dict, optional) –
required module name for input/output. Defaults to {}. For example:
- module_hook_config = {
‘fc1’: [‘output’], ‘fc2’: [‘input’, ‘output’]
}
dataloader – dataloader for model input.
iters – iterations for inference.
calib_func – a custom inference function to replace dataloader and iters.
input_func – preprocess input for less memory usage
output_func – preprocess output for less memory usage
- Returns:
- recorded input_values, output_values.
- for example:
- {‘fc1’:
{‘input’: [], ‘output’: []},
}
- Return type:
total_values