neural_compressor.adaptor.torch_utils.weight_only

Module Contents

Functions

quantize_4bit(tensor[, quantile, data_type, return_int])

Quantize tensor to NF4/FP4 data type.

qdq_weight_asym(weight[, num_bits, quantile, return_int])

Quant and dequant tensor with asym schema.

qdq_weight_sym(weight[, num_bits, quantile, ...])

Quant and dequant tensor with sym schema.

qdq_weight_actor(weight, num_bits, scheme[, quantile, ...])

Quant and dequant tensor per channel.

quant_weight(weight[, num_bits, group_size, scheme, ...])

Quant and dequant tensor with group size. It is an in-place op.

search_clip(m[, num_bits, group_size, scheme, ...])

Search best clip range of each linears in current block.

rtn_quantize(model[, num_bits, group_size, scheme, ...])

Quant the model with round to nearst method.

gptq_quantize(model[, weight_config, dataloader, ...])

Run weight-only quantization with.

awq_quantize(model[, bits, group_size, scheme, ...])

Quant the model with Activation-aware Weight quantization(AWQ) method.

teq_quantize(model[, weight_config, absorb_to_layer, ...])

Run weight-only quantization with.

quant_weight_w_scale(weight, scale, zp[, group_size])

Quant and dequant tensor with group size.

autoround_quantize(model, tokenizer[, bits, ...])

Run autoround weight-only quantization.

neural_compressor.adaptor.torch_utils.weight_only.quantize_4bit(tensor, quantile=1.0, data_type='nf4', return_int=False)[source]

Quantize tensor to NF4/FP4 data type.

Parameters:
  • tensor – input tensor

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • data_type (str, optional) – data type. Defaults to ‘nf4’.

  • return_int (bool, optional) – whether return int data. Defaults to False.

Returns:

fake quantized tensor

Return type:

q_tensor

neural_compressor.adaptor.torch_utils.weight_only.qdq_weight_asym(weight, num_bits=4, quantile=1.0, return_int=False)[source]

Quant and dequant tensor with asym schema.

Parameters:
  • weight – input weight

  • num_bits (int, optional) – num_bits. Defaults to 4.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

Returns:

qdq weight

Return type:

output

neural_compressor.adaptor.torch_utils.weight_only.qdq_weight_sym(weight, num_bits=4, quantile=1.0, return_int=False, full_range=False)[source]

Quant and dequant tensor with sym schema.

Parameters:
  • weight – input weight

  • num_bits (int, optional) – num_bits. Defaults to 4.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

  • full_range (bool, optional) –

    Choose sym range whether use -2**(bits-1). For example: 4 bit

    scale = amax / 8 if full_range else amax / 7 If True, scale = -scale if abs(min)> abs(max) else scale Defaults to False.

Returns:

qdq weight

Return type:

output

neural_compressor.adaptor.torch_utils.weight_only.qdq_weight_actor(weight, num_bits, scheme, quantile=1.0, data_type='int', return_int=False, full_range=False)[source]

Quant and dequant tensor per channel.

Parameters:
  • weight – input weight

  • num_bits (int, optional) – num_bits. Defaults to 4.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • data_type (str, optional) – select from int, nf4, fp4. Defaults to int.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

  • full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

qdq weight

Return type:

output

neural_compressor.adaptor.torch_utils.weight_only.quant_weight(weight, num_bits=4, group_size=-1, scheme='asym', quantile=1.0, data_type='int', return_int=False, full_range=False)[source]

Quant and dequant tensor with group size. It is an in-place op.

Parameters:
  • weight – input weight

  • num_bits (int, optional) – num_bits. Defaults to 4.

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.

  • scheme (str, optional) – sym or asym. Defaults to “asym”.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • data_type (str, optional) – select from int, nf4, fp4. Defaults to int.

  • return_int (bool, optional) – Choose return fp32 or int8/uint8 data. Defaults to False.

  • full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

qdq weight.

Return type:

output

neural_compressor.adaptor.torch_utils.weight_only.search_clip(m, num_bits=4, group_size=32, scheme='asym', data_type='int', enable_full_range=False)[source]

Search best clip range of each linears in current block.

Parameters:
  • m (torch.nn.Module) – torch module.

  • num_bits (int, optional) – num bits.

  • group_size (int, optional) – how many elements share one scale/zp.

  • scheme (str, optional) – sym or asym.

  • data_type (str, optional) – select from int, nf4, fp4. Defaults to int.

  • enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

best percentile of clip

Return type:

best_clip_ratio (float)

neural_compressor.adaptor.torch_utils.weight_only.rtn_quantize(model, num_bits=4, group_size=32, scheme='asym', quantile=1.0, weight_config={}, return_int=False, data_type='int', enable_full_range=False, enable_mse_search=False, group_dim=1, **kwargs)[source]

Quant the model with round to nearst method.

Parameters:
  • model – torch module

  • num_bits – num bits. Defaults to 4.

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to 32.

  • scheme (str, optional) – sym or asym. Defaults to “asym”.

  • quantile (float, optional) – percentile of clip. Defaults to 1.0.

  • data_type (str, optional) – select from int, nf4, fp4. Defaults to int.

  • weight_config (dict, optional) –

    specific layer wise configurations. Defaults to {}. For example,

    weight_config={
    ‘fc2’:
    {

    ‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’ ‘gptq_perm’: [1, 1, …] # for gptq perm

    }

    }

  • return_int (bool, optional) – Choose return fp32 or int32 model. Defaults to False.

  • enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1). Defaults to False.

  • enable_mse_search (bool, optional) – Whether search clip range. Defaults to True.

  • group_dim (int, optional) – 0 means splitting output channel, 1 means splitting input channel. Defaults to 1.

Returns:

fake quantized torch module

Return type:

model

neural_compressor.adaptor.torch_utils.weight_only.gptq_quantize(model, weight_config={}, dataloader=None, nsamples=128, use_max_length=True, pad_max_length=2048, device=None, layer_wise=False, model_path=None)[source]

Run weight-only quantization with.

neural_compressor.adaptor.torch_utils.weight_only.awq_quantize(model, bits=4, group_size=32, scheme='asym', weight_config={}, example_inputs=None, dataloader=None, n_samples=128, calib_func=None, enable_auto_scale=True, enable_mse_search=True, folding=False, return_int=False, enable_full_range=False, data_type='int')[source]

Quant the model with Activation-aware Weight quantization(AWQ) method.

Parameters:
  • model (torch.nn.Module) – torch model.

  • example_inputs – example_inputs.

  • weight_config (dict, optional) –

    contains all info required by AWQ. Defaults to {}. For example,

    weight_config={
    ‘fc2’:
    {

    # ‘absorb_layer’: ‘fc1’, ‘bits’: 4, ‘group_size’: 32, ‘scheme’: ‘sym’

    }

    }

  • absorb_dict (dict, optional) –

    contains all absorb info required by AWQ.. Defaults to {}. For example,

    absorb_dict = {

    # ‘absorb_layer’: absorbed_layer ‘fc1’: [‘fc1’, ‘fc2’, ‘fc3’]

    } # in this case, fc2 and fc3 need to share the same scale. fc1 is self absorbed. # self absorb module will replace with MulLinear, which contains torch.mul and module.

  • n_samples – calibration sample number.

  • enable_auto_scale (bool, optional) – whether enable scale for salient weight. Defaults to True.

  • enable_mse_search (bool, optional) – whether enable clip for weight by checking mse. Defaults to True.

  • calib_func – a custom inference function to replace dataloader and iters.

  • n_blocks – split model into block number to avoid OOM.

  • return_int (bool, optional) – Choose return fp32 or int32 model. Defaults to False.

  • enable_full_range (bool, optional) – Choose sym range whether use -2**(bits-1).

Returns:

fake quantized model

Return type:

model

neural_compressor.adaptor.torch_utils.weight_only.teq_quantize(model, weight_config={}, absorb_to_layer={}, extra_config={}, dataloader=None, calib_func=None, example_inputs=None)[source]

Run weight-only quantization with.

neural_compressor.adaptor.torch_utils.weight_only.quant_weight_w_scale(weight, scale, zp, group_size=-1)[source]

Quant and dequant tensor with group size.

Parameters:
  • weight – input weight

  • scale – scale

  • zp – zero point

  • group_size (int, optional) – how many elements share one scale/zp. Defaults to -1.

Returns:

int weight.

Return type:

output

neural_compressor.adaptor.torch_utils.weight_only.autoround_quantize(model, tokenizer, bits: int = 4, group_size: int = 128, sym: bool = False, weight_config: dict = {}, enable_full_range: bool = False, batch_size: int = 8, amp: bool = True, device=None, lr_scheduler=None, dataloader=None, dataset_name: str = 'NeelNanda/pile-10k', dataset_split: str = 'train', use_quant_input: bool = True, enable_minmax_tuning: bool = True, lr: float = None, minmax_lr: float = None, low_gpu_mem_usage: bool = True, iters: int = 200, seqlen: int = 2048, n_samples: int = 512, sampler: str = 'rand', seed: int = 42, n_blocks: int = 1, gradient_accumulate_steps: int = 1, not_use_best_mse: bool = False, dynamic_max_gap: int = -1, data_type: str = 'int', scale_dtype='fp16', **kwargs)[source]

Run autoround weight-only quantization. Args: model: The PyTorch model to be quantized. tokenizer: Tokenizer for processing input data. Temporarily set as a mandatory parameter. bits (int): Number of bits for quantization (default is 4). group_size (int): Size of the quantization group (default is 128). sym (bool): Whether the symmetric quantization is to be used. weight_config (dict): Configuration for weight quantization (default is an empty dictionary). weight_config={

‘layer1’:##layer_name {

‘data_type’: ‘int’, ‘bits’: 4, ‘group_size’: 32, ‘scheme’: “asym”, ## or sym

}

enable_full_range (bool): Whether to enable full range quantization (default is False). bs (int): Batch size for training (default is 8). amp (bool): Whether to use automatic mixed precision (default is True). Automatically detect and set. device: The device to be used for tuning (default is None). Automatically detect and set. lr_scheduler: The learning rate scheduler to be used. dataloader: The dataloader for input data (to be supported in future). dataset_name (str): The default dataset name (default is “NeelNanda/pile-10k”). dataset_split (str): The split of the dataset to be used (default is “train”). use_quant_input (bool): Whether to use quantized input data (default is True). enable_minmax_tuning (bool): Whether to enable min-max tuning (default is True). lr (float): The learning rate (default is 0.005). minmax_lr (float): The learning rate for min-max tuning (default is None). low_gpu_mem_usage (bool): Whether to use low GPU memory (default is True). iters (int): Number of iterations (default is 200). seqlen (int): Length of the sequence. n_samples (int): Number of samples (default is 512). sampler (str): The sampling method (default is “rand”). seed (int): The random seed (default is 42). n_blocks (int): Number of blocks (default is 1). gradient_accumulate_steps (int): Number of gradient accumulation steps (default is 1). not_use_best_mse (bool): Whether to use mean squared error (default is False). dynamic_max_gap (int): The dynamic maximum gap (default is -1). data_type (str): The data type to be used (default is “int”). **kwargs: Additional keyword arguments.

Returns:

The quantized model.