neural_compressor.torch.quantization.config

Intel Neural Compressor Pytorch quantization config API.

Classes

OperatorConfig

OperatorConfig.

TorchBaseConfig

Base config class for torch backend.

RTNConfig

Config class for round-to-nearest weight-only quantization.

GPTQConfig

Config class for GPTQ.

AWQConfig

Config class for AWQ.

TEQConfig

Config class for TEQ.

AutoRoundConfig

Config class for AUTOROUND.

MXQuantConfig

Config class for MX quantization.

DynamicQuantConfig

Config class for dynamic quantization.

StaticQuantConfig

Config class for static quantization.

SmoothQuantConfig

Config class for smooth quantization.

HQQConfig

Configuration class for Half-Quadratic Quantization (HQQ).

FP8Config

Config class for FP8 quantization.

MixedPrecisionConfig

Config class for mixed-precision.

Functions

get_default_rtn_config(→ RTNConfig)

Get the default configuration of RTN.

get_default_double_quant_config([type])

Get the default configuration of double quant.

get_default_gptq_config(→ GPTQConfig)

Get the default configuration of GPTQ.

get_default_awq_config(→ AWQConfig)

Generate the default awq config.

get_default_teq_config(→ TEQConfig)

Generate the default teq config.

get_default_AutoRound_config(→ RTNConfig)

Get the default configuration of AutoRound.

get_default_mx_config(→ MXQuantConfig)

Generate the default mx config.

get_default_dynamic_config(→ DynamicQuantConfig)

Generate the default dynamic quant config.

get_default_static_config(→ StaticQuantConfig)

Generate the default static quant config.

get_default_sq_config(→ SmoothQuantConfig)

Generate the default smoothquant config.

get_default_hqq_config(→ HQQConfig)

Generate the default HQQ config.

get_default_fp8_config(→ FP8Config)

Generate the default fp8 config.

get_default_fp8_config_set(→ FP8Config)

Generate the default fp8 config set.

get_default_mixed_precision_config(→ MixedPrecisionConfig)

Generate the default mixed-precision config.

get_default_mixed_precision_config_set(...)

Generate the default mixed-precision config set.

get_all_registered_configs(→ Dict[str, ...)

Get all registered configs.

get_woq_tuning_config(→ list)

Generate the config set for WOQ tuning.

Module Contents

class neural_compressor.torch.quantization.config.OperatorConfig[source]

OperatorConfig.

class neural_compressor.torch.quantization.config.TorchBaseConfig(white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST)[source]

Base config class for torch backend.

class neural_compressor.torch.quantization.config.RTNConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = False, double_quant_group_size: int = 256, quant_lm_head: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for round-to-nearest weight-only quantization.

neural_compressor.torch.quantization.config.get_default_rtn_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) RTNConfig[source]

Get the default configuration of RTN.

Parameters:

processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.

Returns:

RTNConfig

Return type:

RTNConfig

neural_compressor.torch.quantization.config.get_default_double_quant_config(type='BNB_NF4')[source]

Get the default configuration of double quant.

Parameters:

type (str, optional) – double quant type. Defaults to “BNB_NF4”.

Returns:

double quant config.

Return type:

dict

class neural_compressor.torch.quantization.config.GPTQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = False, double_quant_group_size: int = 256, quant_lm_head: bool = False, act_order: bool = False, percdamp: float = 0.01, block_size: int = 2048, static_groups: bool = False, true_sequential: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for GPTQ.

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. https://arxiv.org/abs/2210.17323

neural_compressor.torch.quantization.config.get_default_gptq_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) GPTQConfig[source]

Get the default configuration of GPTQ.

Parameters:

processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.

Returns:

GPTQConfig

Return type:

GPTQConfig

class neural_compressor.torch.quantization.config.AWQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, model_path: str = '', use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = True, double_quant_group_size: int = 256, quant_lm_head: bool = False, use_auto_scale: bool = True, use_auto_clip: bool = True, folding: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, absorb_layer_dict: dict = {}, **kwargs)[source]

Config class for AWQ.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978

neural_compressor.torch.quantization.config.get_default_awq_config() AWQConfig[source]

Generate the default awq config.

Returns:

the default awq config.

class neural_compressor.torch.quantization.config.TEQConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = True, group_size: int = 32, group_dim: int = 1, use_full_range: bool = False, use_mse_search: bool = False, use_layer_wise: bool = False, use_double_quant: bool = False, double_quant_dtype: str = 'int', double_quant_bits: int = 8, double_quant_use_sym: bool = True, double_quant_group_size: int = 256, quant_lm_head: bool = False, absorb_to_layer: dict = {}, folding: bool = True, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for TEQ.

TEQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978

neural_compressor.torch.quantization.config.get_default_teq_config() TEQConfig[source]

Generate the default teq config.

Returns:

the default teq config.

class neural_compressor.torch.quantization.config.AutoRoundConfig(dtype: str = 'int', bits: int = 4, use_sym: bool = False, group_size: int = 128, act_bits: int = 32, act_group_size: int = None, act_sym: bool = None, act_dynamic: bool = True, enable_full_range: bool = False, batch_size: int = 8, lr_scheduler=None, enable_quanted_input: bool = True, enable_minmax_tuning: bool = True, lr: float = None, minmax_lr: float = None, low_gpu_mem_usage: bool = False, iters: int = 200, seqlen: int = 2048, nsamples: int = 128, sampler: str = 'rand', seed: int = 42, nblocks: int = 1, gradient_accumulate_steps: int = 1, not_use_best_mse: bool = False, dynamic_max_gap: int = -1, scale_dtype: str = 'fp16', use_layer_wise: bool = False, quant_block_list: list = None, export_format: str = 'itrex', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for AUTOROUND.

AUTOROUND: Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs. https://arxiv.org/abs/2309.05516 code: https://github.com/intel/auto-round

neural_compressor.torch.quantization.config.get_default_AutoRound_config(processor_type: str | neural_compressor.torch.utils.ProcessorType | None = None) RTNConfig[source]

Get the default configuration of AutoRound.

Parameters:

processor_type (Optional[Union[str, torch_utils.ProcessorType]], optional) – The user-specified processor type. Defaults to None.

Returns:

AutoRoundConfig

Return type:

AutoRoundConfig

class neural_compressor.torch.quantization.config.MXQuantConfig(w_dtype: str = 'int8', act_dtype: str = 'int8', out_dtype: str = 'bfloat16', blocksize: int = 32, round_method: str = 'nearest', weight_only: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for MX quantization.

neural_compressor.torch.quantization.config.get_default_mx_config() MXQuantConfig[source]

Generate the default mx config.

Returns:

the default rtn config.

class neural_compressor.torch.quantization.config.DynamicQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_tensor', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'kl', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for dynamic quantization.

neural_compressor.torch.quantization.config.get_default_dynamic_config() DynamicQuantConfig[source]

Generate the default dynamic quant config.

Returns:

the default dynamic quant config.

class neural_compressor.torch.quantization.config.StaticQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_channel', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'minmax', excluded_precisions: list = [], white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, model_info: List[Tuple[str, Callable]] | None = None, **kwargs)[source]

Config class for static quantization.

neural_compressor.torch.quantization.config.get_default_static_config() StaticQuantConfig[source]

Generate the default static quant config.

Returns:

the default static quant config.

class neural_compressor.torch.quantization.config.SmoothQuantConfig(w_dtype: str = 'int8', w_sym: bool = True, w_granularity: str = 'per_channel', w_algo: str = 'minmax', act_dtype: str = 'uint8', act_sym: bool = False, act_granularity: str = 'per_tensor', act_algo: str = 'minmax', excluded_precisions: list = [], alpha: float = 0.5, folding: bool = False, scale_sharing: bool = False, init_alpha: float = 0.5, alpha_min: float = 0.0, alpha_max: float = 1.0, alpha_step: float = 0.1, shared_criterion: str = 'max', do_blockwise: bool = False, auto_alpha_args: dict = None, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for smooth quantization.

neural_compressor.torch.quantization.config.get_default_sq_config() SmoothQuantConfig[source]

Generate the default smoothquant config.

Returns:

the default smoothquant config.

class neural_compressor.torch.quantization.config.HQQConfig(dtype: str = 'int', bits: int = 4, group_size: int = 64, quant_zero: bool = True, quant_scale: bool = False, scale_quant_group_size: int = 128, quant_lm_head: bool = False, white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Configuration class for Half-Quadratic Quantization (HQQ).

HQQ is a quantization algorithm that reduces the precision of weights and activations in neural networks. For more details, refer to the blog: https://mobiusml.github.io/hqq_blog/ and the code: https://github.com/mobiusml/hqq

neural_compressor.torch.quantization.config.get_default_hqq_config() HQQConfig[source]

Generate the default HQQ config.

Returns:

the default HQQ config.

class neural_compressor.torch.quantization.config.FP8Config(dump_stats_path: str = './hqt_output/measure', fp8_config: str = 'E4M3', hp_dtype: str = 'bf16', blocklist: dict = {'names': [], 'types': ()}, allowlist: dict = {'names': [], 'types': FP8_WHITE_LIST}, mode: str = 'AUTO', scale_method: str = 'maxabs_hw', scale_params: dict = {}, observer: str = 'maxabs', mod_dict: dict = {}, measure_exclude: str = 'OUTPUT', fake_quant: bool = False, scale_format: str = 'const', **kwargs)[source]

Config class for FP8 quantization.

neural_compressor.torch.quantization.config.get_default_fp8_config() FP8Config[source]

Generate the default fp8 config.

Returns:

the default fp8 config.

neural_compressor.torch.quantization.config.get_default_fp8_config_set() FP8Config[source]

Generate the default fp8 config set.

Returns:

the default fp8 config.

class neural_compressor.torch.quantization.config.MixedPrecisionConfig(dtype: str | List[str] = 'fp16', white_list: List[neural_compressor.common.utils.OP_NAME_OR_MODULE_TYPE] | None = DEFAULT_WHITE_LIST, **kwargs)[source]

Config class for mixed-precision.

neural_compressor.torch.quantization.config.get_default_mixed_precision_config() MixedPrecisionConfig[source]

Generate the default mixed-precision config.

Returns:

the default mixed-precision config.

neural_compressor.torch.quantization.config.get_default_mixed_precision_config_set() MixedPrecisionConfig[source]

Generate the default mixed-precision config set.

Returns:

the default mixed-precision config.

neural_compressor.torch.quantization.config.get_all_registered_configs() Dict[str, neural_compressor.common.base_config.BaseConfig][source]

Get all registered configs.

neural_compressor.torch.quantization.config.get_woq_tuning_config() list[source]

Generate the config set for WOQ tuning.

Returns:

the list of WOQ quant config.