neural_compressor.torch.quantization.config
Intel Neural Compressor Pytorch quantization config API.
Classes
OperatorConfig. |
|
Base config class for torch backend. |
|
Config class for round-to-nearest weight-only quantization. |
|
Config class for GPTQ. |
|
Config class for AWQ. |
|
Config class for TEQ. |
|
Config class for AUTOROUND. |
|
Config class for MX quantization. |
|
Config class for dynamic quantization. |
|
Config class for static quantization. |
|
Config class for smooth quantization. |
|
Configuration class for Half-Quadratic Quantization (HQQ). |
|
Config class for FP8 quantization. |
|
Config class for mixed-precision. |
Functions
|
Get the default configuration of RTN. |
|
Get the default configuration of double quant. |
|
Get the default configuration of GPTQ. |
|
Generate the default awq config. |
|
Generate the default teq config. |
|
Get the default configuration of AutoRound. |
|
Generate the default mx config. |
|
Generate the default dynamic quant config. |
|
Generate the default static quant config. |
|
Generate the default smoothquant config. |
|
Generate the default HQQ config. |
|
Generate the default fp8 config. |
|
Generate the default fp8 config set. |
|
Generate the default mixed-precision config. |
Generate the default mixed-precision config set. |
|
|
Get all registered configs. |
|
Generate the config set for WOQ tuning. |
Module Contents
- class neural_compressor.torch.quantization.config.TorchBaseConfig(white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST)[source]
Base config class for torch backend.
- class neural_compressor.torch.quantization.config.RTNConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = False, double_quant_group_size: int = 256, quant_lm_head: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for round-to-nearest weight-only quantization.
- neural_compressor.torch.quantization.config.get_default_rtn_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) RTNConfig [source]
Get the default configuration of RTN.
- Parameters:
processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.
- Returns:
RTNConfig
- Return type:
- neural_compressor.torch.quantization.config.get_default_double_quant_config(type='BNB_NF4')[source]
Get the default configuration of double quant.
- Parameters:
type (str, optional) – double quant type. Defaults to “BNB_NF4”.
- Returns:
double quant config.
- Return type:
dict
- class neural_compressor.torch.quantization.config.GPTQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = False, double_quant_group_size: int = 256, quant_lm_head: bool = False, act_order: bool = False, percdamp: float = 0.01, block_size: int = 2048, static_groups: bool = False, true_sequential: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for GPTQ.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. https://arxiv.org/abs/2210.17323
- neural_compressor.torch.quantization.config.get_default_gptq_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) GPTQConfig [source]
Get the default configuration of GPTQ.
- Parameters:
processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.
- Returns:
GPTQConfig
- Return type:
- class neural_compressor.torch.quantization.config.AWQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = True, double_quant_group_size: int = 256, quant_lm_head: bool = False, use_auto_scale: bool = True, use_auto_clip: bool = True, folding: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, absorb_layer_dict: dict = {}, **kwargs)[source]
Config class for AWQ.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978
- neural_compressor.torch.quantization.config.get_default_awq_config() AWQConfig [source]
Generate the default awq config.
- Returns:
the default awq config.
- class neural_compressor.torch.quantization.config.TEQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = True, double_quant_group_size: int = 256, quant_lm_head: bool = False, absorb_to_layer: dict = {}, folding: bool = True, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for TEQ.
TEQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978
- neural_compressor.torch.quantization.config.get_default_teq_config() TEQConfig [source]
Generate the default teq config.
- Returns:
the default teq config.
- class neural_compressor.torch.quantization.config.AutoRoundConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = False, group_size: int = 128, act_bits: int = 32, act_group_size: int = None, act_sym: bool = None, act_dynamic: bool = True, enable_full_range: bool = False, batch_size: int = 8, lr_scheduler=None, enable_quanted_input: bool = True, enable_minmax_tuning: bool = True, lr: float = None, minmax_lr: float = None, low_gpu_mem_usage: bool = False, iters: int = 200, seqlen: int = 2048, nsamples: int = 128, sampler: str = 'rand', seed: int = 42, nblocks: int = 1, gradient_accumulate_steps: int = 1, not_use_best_mse: bool = False, dynamic_max_gap: int = -1, scale_dtype: str = 'fp16', use_layer_wise: bool = False, quant_block_list: list = None, export_format: str = 'itrex', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for AUTOROUND.
AUTOROUND: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. https://arxiv.org/abs/2309.05516 code: https://github.com/intel/auto-round
- neural_compressor.torch.quantization.config.get_default_AutoRound_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) RTNConfig [source]
Get the default configuration of AutoRound.
- Parameters:
processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.
- Returns:
AutoRoundConfig
- Return type:
- class neural_compressor.torch.quantization.config.MXQuantConfig(w_dtype: str = 'int8', act_dtype: str = 'int8', out_dtype: str = 'bfloat16', blocksize: int = 32, round_method: str = 'nearest', weight_only: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for MX quantization.
- neural_compressor.torch.quantization.config.get_default_mx_config() MXQuantConfig [source]
Generate the default mx config.
- Returns:
the default rtn config.
- class neural_compressor.torch.quantization.config.DynamicQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_tensor', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'kl', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for dynamic quantization.
- neural_compressor.torch.quantization.config.get_default_dynamic_config() DynamicQuantConfig [source]
Generate the default dynamic quant config.
- Returns:
the default dynamic quant config.
- class neural_compressor.torch.quantization.config.StaticQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_channel', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'minmax', excluded_precisions: list = [], white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, model_info: List[Tuple[str, Callable]] | None = None, **kwargs)[source]
Config class for static quantization.
- neural_compressor.torch.quantization.config.get_default_static_config() StaticQuantConfig [source]
Generate the default static quant config.
- Returns:
the default static quant config.
- class neural_compressor.torch.quantization.config.SmoothQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_channel', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'minmax', excluded_precisions: list = [], alpha: float = 0.5, folding: bool = False, scale_sharing: bool = False, init_alpha: float = 0.5, alpha_min: float = 0.0, alpha_max: float = 1.0, alpha_step: float = 0.1, shared_criterion: str = 'max', do_blockwise: bool = False, auto_alpha_args: dict = None, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for smooth quantization.
- neural_compressor.torch.quantization.config.get_default_sq_config() SmoothQuantConfig [source]
Generate the default smoothquant config.
- Returns:
the default smoothquant config.
- class neural_compressor.torch.quantization.config.HQQConfig(dtype: str = 'int', bits: int = 4, group_size: int = 64, quant_zero: bool = True, quant_scale: bool = False, scale_quant_group_size: int = 128, quant_lm_head: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Configuration class for Half-Quadratic Quantization (HQQ).
HQQ is a quantization algorithm that reduces the precision of weights and activations in neural networks. For more details, refer to the blog: https://mobiusml.github.io/hqq_blog/ and the code: https://github.com/mobiusml/hqq
- neural_compressor.torch.quantization.config.get_default_hqq_config() HQQConfig [source]
Generate the default HQQ config.
- Returns:
the default HQQ config.
- class neural_compressor.torch.quantization.config.FP8Config(dump_stats_path: str = './hqt_output/measure', fp8_config: str = 'E4M3', hp_dtype: str = 'bf16', blocklist: dict = {'names': [], 'types': ()}, allowlist: dict = {'names': [], 'types': FP8_WHITE_LIST}, mode: str = 'AUTO', scale_method: str = 'maxabs_hw', scale_params: dict = {}, observer: str = 'maxabs', mod_dict: dict = {}, measure_exclude: str = 'OUTPUT', fake_quant: bool = False, scale_format: str = 'const', **kwargs)[source]
Config class for FP8 quantization.
- neural_compressor.torch.quantization.config.get_default_fp8_config() FP8Config [source]
Generate the default fp8 config.
- Returns:
the default fp8 config.
- neural_compressor.torch.quantization.config.get_default_fp8_config_set() FP8Config [source]
Generate the default fp8 config set.
- Returns:
the default fp8 config.
- class neural_compressor.torch.quantization.config.MixedPrecisionConfig(dtype: str | List[str] = 'fp16', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]
Config class for mixed-precision.
- neural_compressor.torch.quantization.config.get_default_mixed_precision_config() MixedPrecisionConfig [source]
Generate the default mixed-precision config.
- Returns:
the default mixed-precision config.
- neural_compressor.torch.quantization.config.get_default_mixed_precision_config_set() MixedPrecisionConfig [source]
Generate the default mixed-precision config set.
- Returns:
the default mixed-precision config.
- neural_compressor.torch.quantization.config.get_all_registered_configs() Dict[str, neural_compressor.common.base_config.BaseConfig] [source]
Get all registered configs.