neural_compressor.torch.quantization.config

Intel Neural Compressor Pytorch quantization config API.

Classes

`OperatorConfig`	OperatorConfig.
`TorchBaseConfig`	Base config class for torch backend.
`RTNConfig`	Config class for round-to-nearest weight-only quantization.
`GPTQConfig`	Config class for GPTQ.
`AWQConfig`	Config class for AWQ.
`TEQConfig`	Config class for TEQ.
`AutoRoundConfig`	Config class for AUTOROUND.
`MXQuantConfig`	Config class for MX quantization.
`DynamicQuantConfig`	Config class for dynamic quantization.
`INT8StaticQuantConfig`	Config class for static quantization.
`SmoothQuantConfig`	Config class for smooth quantization.
`HQQConfig`	Configuration class for Half-Quadratic Quantization (HQQ).
`FP8Config`	Config class for FP8 quantization.
`HybridGPTQConfig`	Config class for Hybrid Precision GPTQ quantization.
`MixedPrecisionConfig`	Config class for mixed-precision.
`StaticQuantConfig`	Base config class for torch backend.

Functions

`get_default_rtn_config`(→ RTNConfig)	Get the default configuration of RTN.
`get_default_double_quant_config`([type])	Get the default configuration of double quant.
`get_default_gptq_config`(→ GPTQConfig)	Get the default configuration of GPTQ.
`get_default_awq_config`(→ AWQConfig)	Generate the default awq config.
`get_default_teq_config`(→ TEQConfig)	Generate the default teq config.
`get_default_AutoRound_config`(→ RTNConfig)	Get the default configuration of AutoRound.
`get_default_mx_config`(→ MXQuantConfig)	Generate the default mx config.
`get_default_dynamic_config`(→ DynamicQuantConfig)	Generate the default dynamic quant config.
`get_default_static_config`(→ INT8StaticQuantConfig)	Generate the default static quant config.
`get_default_sq_config`(→ SmoothQuantConfig)	Generate the default smoothquant config.
`get_default_hqq_config`(→ HQQConfig)	Generate the default HQQ config.
`get_default_fp8_config`(→ FP8Config)	Generate the default fp8 config.
`get_default_fp8_config_set`(→ FP8Config)	Generate the default fp8 config set.
`get_default_mixed_precision_config`(→ MixedPrecisionConfig)	Generate the default mixed-precision config.
`get_default_mixed_precision_config_set`(...)	Generate the default mixed-precision config set.
`get_all_registered_configs`(→ Dict[str, ...)	Get all registered configs.
`get_woq_tuning_config`(→ list)	Generate the config set for WOQ tuning.
`get_default_qat_module_mappings`(→ dict[Callable, Any])	Get default module mapping for quantization aware training.

Module Contents

class neural_compressor.torch.quantization.config.OperatorConfig[source]: OperatorConfig.

class neural_compressor.torch.quantization.config.TorchBaseConfig(white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST)[source]: Base config class for torch backend.

class neural_compressor.torch.quantization.config.RTNConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = False, double_quant_group_size: int = 256, quant_lm_head: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]: Config class for round-to-nearest weight-only quantization.

neural_compressor.torch.quantization.config.get_default_rtn_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) → RTNConfig[source]

Get the default configuration of RTN.

Parameters:: processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.
Returns:: RTNConfig
Return type:: RTNConfig

neural_compressor.torch.quantization.config.get_default_double_quant_config(type='BNB_NF4')[source]

Get the default configuration of double quant.

Parameters:: type (str, optional) – double quant type. Defaults to “BNB_NF4”.
Returns:: double quant config.
Return type:: dict

class neural_compressor.torch.quantization.config.GPTQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, use_mse_search: bool = False, use_layer_wise: bool = False, use_block_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = False, double_quant_group_size: int = 256, quant_lm_head: bool = False, act_order: bool = False, hybrid_order: bool = False, fp8_aware: bool = False, percdamp: float = 0.01, block_size: int = 2048, static_groups: bool = False, true_sequential: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for GPTQ.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. https://arxiv.org/abs/2210.17323

neural_compressor.torch.quantization.config.get_default_gptq_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) → GPTQConfig[source]

Get the default configuration of GPTQ.

Parameters:: processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.
Returns:: GPTQConfig
Return type:: GPTQConfig

class neural_compressor.torch.quantization.config.AWQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = True, double_quant_group_size: int = 256, quant_lm_head: bool = False, use_auto_scale: bool = True, use_auto_clip: bool = True, folding: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, absorb_layer_dict: dict = {}, **kwargs)[source]

Config class for AWQ.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978

neural_compressor.torch.quantization.config.get_default_awq_config() → AWQConfig[source]

Generate the default awq config.

Returns:: the default awq config.

class neural_compressor.torch.quantization.config.TEQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = True, double_quant_group_size: int = 256, quant_lm_head: bool = False, absorb_to_layer: dict = {}, folding: bool = True, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for TEQ.

TEQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978

neural_compressor.torch.quantization.config.get_default_teq_config() → TEQConfig[source]

Generate the default teq config.

Returns:: the default teq config.

class neural_compressor.torch.quantization.config.AutoRoundConfig(bits: int = None, group_size: int = None, use_sym: bool = None, dtype: str = None, act_bits: int = None, act_group_size: int = None, act_sym: bool = None, act_dtype: str = None, act_dynamic: bool = None, super_bits: int = None, super_group_size: int = None, enable_full_range: bool = False, batch_size: int = 8, amp: bool = True, lr_scheduler=None, enable_quanted_input: bool = True, enable_minmax_tuning: bool = True, lr: float = None, minmax_lr: float = None, low_gpu_mem_usage: bool = False, iters: int = 200, seqlen: int = 2048, nsamples: int = 128, sampler: str = 'rand', seed: int = 42, nblocks: int = 1, gradient_accumulate_steps: int = 1, not_use_best_mse: bool = False, dynamic_max_gap: int = -1, scale_dtype: str = 'fp16', use_layer_wise: bool = False, to_quant_block_names: list = None, export_format: str = 'itrex', enable_norm_bias_tuning: bool = False, enable_torch_compile: bool = False, scheme: str | dict = 'W4A16', device_map: str | int | torch.device | dict = 0, quant_nontext_module: bool = False, extra_data_dir: str = None, processor=None, image_processor=None, template=None, truncation: bool = False, quant_lm_head: bool = False, enable_adam: bool = False, target_bits: int = None, options: str | list[str] | tuple[str, Ellipsis] = ('MXFP4', 'MXFP8'), shared_layers: Iterable[Iterable[str]] | None = None, ignore_scale_zp_bits: bool = False, auto_scheme_method: str = 'default', auto_scheme_device_map: str = None, auto_scheme_batch_size: int = None, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for AUTOROUND.

AUTOROUND: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. https://arxiv.org/abs/2309.05516 code: https://github.com/intel/auto-round

neural_compressor.torch.quantization.config.get_default_AutoRound_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) → RTNConfig[source]

Get the default configuration of AutoRound.

Parameters:: processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.
Returns:: AutoRoundConfig
Return type:: AutoRoundConfig

class neural_compressor.torch.quantization.config.MXQuantConfig(w_dtype: str = 'int8', act_dtype: str = 'int8', out_dtype: str = 'bfloat16', blocksize: int = 32, round_method: str = 'nearest', weight_only: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]: Config class for MX quantization.

neural_compressor.torch.quantization.config.get_default_mx_config() → MXQuantConfig[source]

Generate the default mx config.

Returns:: the default rtn config.

class neural_compressor.torch.quantization.config.DynamicQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_tensor', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'kl', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]: Config class for dynamic quantization.

neural_compressor.torch.quantization.config.get_default_dynamic_config() → DynamicQuantConfig[source]

Generate the default dynamic quant config.

Returns:: the default dynamic quant config.

class neural_compressor.torch.quantization.config.INT8StaticQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_channel', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'minmax', excluded_precisions: list = [], white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, model_info: List[Tuple[str, Callable]] | None = None, **kwargs)[source]: Config class for static quantization.

neural_compressor.torch.quantization.config.get_default_static_config() → INT8StaticQuantConfig[source]

Generate the default static quant config.

Returns:: the default static quant config.

class neural_compressor.torch.quantization.config.SmoothQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_channel', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'minmax', excluded_precisions: list = [], alpha: float = 0.5, folding: bool = False, scale_sharing: bool = False, init_alpha: float = 0.5, alpha_min: float = 0.0, alpha_max: float = 1.0, alpha_step: float = 0.1, shared_criterion: str = 'max', do_blockwise: bool = False, auto_alpha_args: dict = None, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]: Config class for smooth quantization.

neural_compressor.torch.quantization.config.get_default_sq_config() → SmoothQuantConfig[source]

Generate the default smoothquant config.

Returns:: the default smoothquant config.

class neural_compressor.torch.quantization.config.HQQConfig(dtype: str = 'int', bits: int = 4, group_size: int = 64, quant_zero: bool = True, quant_scale: bool = False, scale_quant_group_size: int = 128, quant_lm_head: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Configuration class for Half-Quadratic Quantization (HQQ).

HQQ is a quantization algorithm that reduces the precision of weights and activations in neural networks. For more details, refer to the blog: https://mobiusml.github.io/hqq_blog/ and the code: https://github.com/mobiusml/hqq

neural_compressor.torch.quantization.config.get_default_hqq_config() → HQQConfig[source]

Generate the default HQQ config.

Returns:: the default HQQ config.

class neural_compressor.torch.quantization.config.FP8Config(dump_stats_path: str = './hqt_output/measure', fp8_config: str = 'E4M3', hp_dtype: str = 'bf16', blocklist: dict = {'names': [], 'types': ()}, allowlist: dict = {'names': [], 'types': get_white_list()}, mode: str = 'AUTO', scale_method: str | dict = 'maxabs_hw', scale_params: dict = {}, observer: str = 'maxabs', mod_dict: dict = {}, measure_exclude: str = 'OUTPUT', fake_quant: bool = False, use_qdq: bool = False, scale_format: str = 'scalar', measure_on_hpu: bool = True, **kwargs)[source]: Config class for FP8 quantization.

neural_compressor.torch.quantization.config.get_default_fp8_config() → FP8Config[source]

Generate the default fp8 config.

Returns:: the default fp8 config.

neural_compressor.torch.quantization.config.get_default_fp8_config_set() → FP8Config[source]

Generate the default fp8 config set.

Returns:: the default fp8 config.

class neural_compressor.torch.quantization.config.HybridGPTQConfig(**kwargs)[source]

Config class for Hybrid Precision GPTQ quantization.

Currently supports running 4bit weights which have been quantized by GPTQ, during which the weights have been double quantized from high precision, to fp8, to int4. The activations will be quantized to fp8.

class neural_compressor.torch.quantization.config.MixedPrecisionConfig(dtype: str | List[str] = 'fp16', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]: Config class for mixed-precision.

neural_compressor.torch.quantization.config.get_default_mixed_precision_config() → MixedPrecisionConfig[source]

Generate the default mixed-precision config.

Returns:: the default mixed-precision config.

neural_compressor.torch.quantization.config.get_default_mixed_precision_config_set() → MixedPrecisionConfig[source]

Generate the default mixed-precision config set.

Returns:: the default mixed-precision config.

neural_compressor.torch.quantization.config.get_all_registered_configs() → Dict[str, neural_compressor.common.base_config.BaseConfig][source]: Get all registered configs.

neural_compressor.torch.quantization.config.get_woq_tuning_config() → list[source]

Generate the config set for WOQ tuning.

Returns:: the list of WOQ quant config.

class neural_compressor.torch.quantization.config.StaticQuantConfig(white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST)[source]: Base config class for torch backend.

neural_compressor.torch.quantization.config.get_default_qat_module_mappings() → dict[Callable, Any][source]: Get default module mapping for quantization aware training.