neural_compressor.torch.quantization.algorithm_entry

Intel Neural Compressor PyTorch supported algorithm entries.

Functions

`rtn_entry`(→ torch.nn.Module)	The main entry to apply rtn quantization.
`gptq_entry`(→ torch.nn.Module)	The main entry to apply gptq quantization.
`static_quant_entry`(→ torch.nn.Module)	The main entry to apply static quantization, includes pt2e quantization and ipex quantization.
`pt2e_dynamic_quant_entry`(→ torch.nn.Module)	The main entry to apply pt2e dynamic quantization.
`pt2e_static_quant_entry`(→ torch.nn.Module)	The main entry to apply pt2e static quantization.
`smooth_quant_entry`(→ torch.nn.Module)	The main entry to apply smooth quantization.
`awq_quantize_entry`(→ torch.nn.Module)	The main entry to apply AWQ quantization.
`teq_quantize_entry`(→ torch.nn.Module)	The main entry to apply TEQ quantization.
`autoround_quantize_entry`(→ torch.nn.Module)	The main entry to apply AutoRound quantization.
`hqq_entry`(→ torch.nn.Module)	The main entry to apply AutoRound quantization.
`fp8_entry`(→ torch.nn.Module)	The main entry to apply fp8 quantization.
`mx_quant_entry`(→ torch.nn.Module)	The main entry to apply AutoRound quantization.
`mixed_precision_entry`(→ torch.nn.Module)	The main entry to apply Mixed Precision.

Module Contents

neural_compressor.torch.quantization.algorithm_entry.rtn_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.RTNConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply rtn quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], RTNConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.gptq_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.GPTQConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply gptq quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], GPTQConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.static_quant_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.StaticQuantConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply static quantization, includes pt2e quantization and ipex quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], StaticQuantConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.pt2e_dynamic_quant_entry(model: torch.nn.Module, configs_mapping, mode: neural_compressor.common.utils.Mode, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply pt2e dynamic quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.pt2e_static_quant_entry(model: torch.nn.Module, configs_mapping, mode: neural_compressor.common.utils.Mode, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply pt2e static quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.smooth_quant_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.SmoothQuantConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply smooth quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], SmoothQuantConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.awq_quantize_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.AWQConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply AWQ quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], AWQConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.teq_quantize_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.TEQConfig], mode: neural_compressor.common.utils.Mode, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply TEQ quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], TEQConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.autoround_quantize_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.AutoRoundConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply AutoRound quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], AutoRoundConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.hqq_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, Callable], neural_compressor.torch.quantization.HQQConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply AutoRound quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], AutoRoundConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.fp8_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str], neural_compressor.torch.quantization.FP8Config], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]: The main entry to apply fp8 quantization.

neural_compressor.torch.quantization.algorithm_entry.mx_quant_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.MXQuantConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) → torch.nn.Module[source]

The main entry to apply AutoRound quantization.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], AutoRoundConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.mixed_precision_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str], neural_compressor.torch.quantization.MixedPrecisionConfig], *args, **kwargs) → torch.nn.Module[source]

The main entry to apply Mixed Precision.

Parameters:

model (torch.nn.Module) – raw fp32 model or prepared model.
configs_mapping (Dict[Tuple[str, callable], MixPrecisionConfig]) – per-op configuration.
mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module