neural_compressor.adaptor.torch_utils.weight_only
Module Contents
Functions
|
Quantize tensor to NF4/FP4 data type. |
|
Quant and dequant tensor with asym schema. |
|
Quant and dequant tensor with sym schema. |
|
Quant and dequant tensor per channel. |
|
Quant and dequant tensor with group size. It is an in-place op. |
|
Search best clip range of each linears in current block. |
|
Quant the model with round to nearst method. |
|
Run weight-only quantization with. |
|
Quant the model with Activation-aware Weight quantization(AWQ) method. |
|
Run weight-only quantization with. |
|
Quant and dequant tensor with group size. |
|
Run autoround weight-only quantization. |
- neural_compressor.adaptor.torch_utils.weight_only.quantize_4bit(tensor, quantile=1.0, data_type='nf4', return_int=False)[source]
Quantize tensor to NF4/FP4 data type.
- Parameters:
tensor – input tensor
quantile (float, optional) – percentile of clip. Defaults to 1.0.
data_type (str, optional) – data type. Defaults to ‘nf4’.
return_int (bool, optional) – whether return int data. Defaults to False.
- Returns:
fake quantized tensor
- Return type:
q_tensor
- neural_compressor.adaptor.torch_utils.weight_only.qdq_weight_asym(weight, num_bits=4, quantile=1.0, return_int=False)[source]
Quant and dequant tensor with asym schema.
- Parameters:
weight – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
- Returns:
qdq weight
- Return type:
output
- neural_compressor.adaptor.torch_utils.weight_only.qdq_weight_sym(weight, num_bits=4, quantile=1.0, return_int=False, full_range=False)[source]
Quant and dequant tensor with sym schema.
- Parameters:
weight – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) –
Choose sym range whether use -2**(bits-1). For example: 4 bit
scale = amax / 8 if full_range else amax / 7 If True, scale = -scale if abs(min)> abs(max) else scale Defaults to False.
- Returns:
qdq weight
- Return type:
output
- neural_compressor.adaptor.torch_utils.weight_only.qdq_weight_actor(weight, num_bits, scheme, quantile=1.0, data_type='int', return_int=False, full_range=False)[source]
Quant and dequant tensor per channel.
- Parameters:
weight – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
data_type (str, optional) – select from int, nf4, fp4. Defaults to int.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) – Choose sym range whether use -2**(bits-1).
- Returns:
qdq weight
- Return type:
output
- neural_compressor.adaptor.torch_utils.weight_only.quant_weight(weight, num_bits=4, group_size=-1, scheme='asym', quantile=1.0, data_type='int', return_int=False, full_range=False)[source]
Quant and dequant tensor with group size. It is an in-place op.
- Parameters:
weight – input weight
num_bits (int, optional) – num_bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.
scheme (str, optional) – sym or asym. Defaults to “asym”.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
data_type (str, optional) – select from int, nf4, fp4. Defaults to int.
return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.
full_range (bool, optional) – Choose sym range whether use -2**(bits-1).
- Returns:
qdq weight.
- Return type:
output
- neural_compressor.adaptor.torch_utils.weight_only.search_clip(m, num_bits=4, group_size=32, scheme='asym', data_type='int', enable_full_range=False)[source]
Search best clip range of each linears in current block.
- Parameters:
m (torch.nn.Module) – torch module.
num_bits (int, optional) – num bits.
group_size (int, optional) – how many elements share one scale/zp.
scheme (str, optional) – sym or asym.
data_type (str, optional) – select from int, nf4, fp4. Defaults to int.
enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1).
- Returns:
best percentile of clip
- Return type:
best_clip_ratio (float)
- neural_compressor.adaptor.torch_utils.weight_only.rtn_quantize(model, num_bits=4, group_size=32, scheme='asym', quantile=1.0, weight_config={}, return_int=False, data_type='int', enable_full_range=False, enable_mse_search=False, group_dim=1, **kwargs)[source]
Quant the model with round to nearst method.
- Parameters:
model – torch module
num_bits – num bits. Defaults to 4.
group_size (int, optional) – how many elements share one scale/zp. Defaults to 32.
scheme (str, optional) – sym or asym. Defaults to “asym”.
quantile (float, optional) – percentile of clip. Defaults to 1.0.
data_type (str, optional) – select from int, nf4, fp4. Defaults to int.
weight_config (dict, optional) –
specific layer wise configurations. Defaults to {}. For example,
- weight_config={
- ‘fc2’:
- {
‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’ ‘gptq_perm’: [1, 1, …] # for gptq perm
}
}
return_int (bool, optional) – Choose return fp32 or int32 model. Defaults to False.
enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1). Defaults to False.
enable_mse_search (bool, optional) – Whether search clip range. Defaults to True.
group_dim (int, optional) – 0 means splitting output channel, 1 means splitting input channel. Defaults to 1.
- Returns:
fake quantized torch module
- Return type:
model
- neural_compressor.adaptor.torch_utils.weight_only.gptq_quantize(model, weight_config={}, dataloader=None, nsamples=128, use_max_length=True, pad_max_length=2048, device=None, layer_wise=False, model_path=None)[source]
Run weight-only quantization with.
- neural_compressor.adaptor.torch_utils.weight_only.awq_quantize(model, bits=4, group_size=32, scheme='asym', weight_config={}, example_inputs=None, dataloader=None, n_samples=128, calib_func=None, enable_auto_scale=True, enable_mse_search=True, folding=False, return_int=False, enable_full_range=False, data_type='int')[source]
Quant the model with Activation-aware Weight quantization(AWQ) method.
- Parameters:
model (torch.nn.Module) – torch model.
example_inputs – example_inputs.
weight_config (dict, optional) –
contains all info required by AWQ. Defaults to {}. For example,
- weight_config={
- ‘fc2’:
- {
# ‘absorb_layer’: ‘fc1’, ‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’
}
}
absorb_dict (dict, optional) –
contains all absorb info required by AWQ.. Defaults to {}. For example,
- absorb_dict = {
# ‘absorb_layer’: absorbed_layer ‘fc1’: [‘fc1’, ‘fc2’, ‘fc3’]
} # in this case, fc2 and fc3 need to share the same scale. fc1 is self absorbed. # self absorb module will replace with MulLinear, which contains torch.mul and module.
n_samples – calibration sample number.
enable_auto_scale (bool, optional) – whether enable scale for salient weight. Defaults to True.
enable_mse_search (bool, optional) – whether enable clip for weight by checking mse. Defaults to True.
calib_func – a custom inference function to replace dataloader and iters.
n_blocks – split model into block number to avoid OOM.
return_int (bool, optional) – Choose return fp32 or int32 model. Defaults to False.
enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1).
- Returns:
fake quantized model
- Return type:
model
- neural_compressor.adaptor.torch_utils.weight_only.teq_quantize(model, weight_config={}, absorb_to_layer={}, extra_config={}, dataloader=None, calib_func=None, example_inputs=None)[source]
Run weight-only quantization with.
- neural_compressor.adaptor.torch_utils.weight_only.quant_weight_w_scale(weight, scale, zp, group_size=-1)[source]
Quant and dequant tensor with group size.
- Parameters:
weight – input weight
scale – scale
zp – zero point
group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.
- Returns:
int weight.
- Return type:
output
- neural_compressor.adaptor.torch_utils.weight_only.autoround_quantize(model, tokenizer, bits: int = 4, group_size: int = 128, sym: bool = False, weight_config: dict = {}, enable_full_range: bool = False, batch_size: int = 8, amp: bool = True, device=None, lr_scheduler=None, dataloader=None, dataset_name: str = 'NeelNanda/pile-10k', dataset_split: str = 'train', use_quant_input: bool = True, enable_minmax_tuning: bool = True, lr: float = None, minmax_lr: float = None, low_gpu_mem_usage: bool = True, iters: int = 200, seqlen: int = 2048, n_samples: int = 512, sampler: str = 'rand', seed: int = 42, n_blocks: int = 1, gradient_accumulate_steps: int = 1, not_use_best_mse: bool = False, dynamic_max_gap: int = -1, data_type: str = 'int', scale_dtype='fp16', **kwargs)[source]
Run autoround weight-only quantization. Args: model: The PyTorch model to be quantized. tokenizer: Tokenizer for processing input data. Temporarily set as a mandatory parameter. bits (int): Number of bits for quantization (default is 4). group_size (int): Size of the quantization group (default is 128). sym (bool): Whether the symmetric quantization is to be used. weight_config (dict): Configuration for weight quantization (default is an empty dictionary). weight_config={
‘layer1’:##layer_name {
‘data_type’: ‘int’, ‘bits’: 4, ‘group_size’: 32, ‘scheme’: “asym”, ## or sym
}
enable_full_range (bool): Whether to enable full range quantization (default is False). bs (int): Batch size for training (default is 8). amp (bool): Whether to use automatic mixed precision (default is True). Automatically detect and set. device: The device to be used for tuning (default is None). Automatically detect and set. lr_scheduler: The learning rate scheduler to be used. dataloader: The dataloader for input data (to be supported in future). dataset_name (str): The default dataset name (default is “NeelNanda/pile-10k”). dataset_split (str): The split of the dataset to be used (default is “train”). use_quant_input (bool): Whether to use quantized input data (default is True). enable_minmax_tuning (bool): Whether to enable min-max tuning (default is True). lr (float): The learning rate (default is 0.005). minmax_lr (float): The learning rate for min-max tuning (default is None). low_gpu_mem_usage (bool): Whether to use low GPU memory (default is True). iters (int): Number of iterations (default is 200). seqlen (int): Length of the sequence. n_samples (int): Number of samples (default is 512). sampler (str): The sampling method (default is “rand”). seed (int): The random seed (default is 42). n_blocks (int): Number of blocks (default is 1). gradient_accumulate_steps (int): Number of gradient accumulation steps (default is 1). not_use_best_mse (bool): Whether to use mean squared error (default is False). dynamic_max_gap (int): The dynamic maximum gap (default is -1). data_type (str): The data type to be used (default is “int”). **kwargs: Additional keyword arguments.
- Returns:
The quantized model.