neural_compressor.torch.quantization.algorithm_entry

Intel Neural Compressor PyTorch supported algorithm entries.

Functions

rtn_entry(→ torch.nn.Module)

The main entry to apply rtn quantization.

gptq_entry(→ torch.nn.Module)

The main entry to apply gptq quantization.

static_quant_entry(→ torch.nn.Module)

The main entry to apply static quantization, includes pt2e quantization and ipex quantization.

pt2e_dynamic_quant_entry(→ torch.nn.Module)

The main entry to apply pt2e dynamic quantization.

pt2e_static_quant_entry(→ torch.nn.Module)

The main entry to apply pt2e static quantization.

smooth_quant_entry(→ torch.nn.Module)

The main entry to apply smooth quantization.

awq_quantize_entry(→ torch.nn.Module)

The main entry to apply AWQ quantization.

teq_quantize_entry(→ torch.nn.Module)

The main entry to apply TEQ quantization.

autoround_quantize_entry(→ torch.nn.Module)

The main entry to apply AutoRound quantization.

hqq_entry(→ torch.nn.Module)

The main entry to apply AutoRound quantization.

fp8_entry(→ torch.nn.Module)

The main entry to apply fp8 quantization.

mx_quant_entry(→ torch.nn.Module)

The main entry to apply AutoRound quantization.

mixed_precision_entry(→ torch.nn.Module)

The main entry to apply Mixed Precision.

Module Contents

neural_compressor.torch.quantization.algorithm_entry.rtn_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.RTNConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply rtn quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], RTNConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.gptq_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.GPTQConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply gptq quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], GPTQConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.static_quant_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.StaticQuantConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply static quantization, includes pt2e quantization and ipex quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], StaticQuantConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.pt2e_dynamic_quant_entry(model: torch.nn.Module, configs_mapping, mode: neural_compressor.common.utils.Mode, *args, **kwargs) torch.nn.Module[source]

The main entry to apply pt2e dynamic quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.pt2e_static_quant_entry(model: torch.nn.Module, configs_mapping, mode: neural_compressor.common.utils.Mode, *args, **kwargs) torch.nn.Module[source]

The main entry to apply pt2e static quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.smooth_quant_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.SmoothQuantConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply smooth quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], SmoothQuantConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.awq_quantize_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.AWQConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply AWQ quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], AWQConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.teq_quantize_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.TEQConfig], mode: neural_compressor.common.utils.Mode, *args, **kwargs) torch.nn.Module[source]

The main entry to apply TEQ quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], TEQConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.autoround_quantize_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.AutoRoundConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply AutoRound quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], AutoRoundConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.hqq_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, Callable], neural_compressor.torch.quantization.HQQConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply AutoRound quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], AutoRoundConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.fp8_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str], neural_compressor.torch.quantization.FP8Config], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply fp8 quantization.

neural_compressor.torch.quantization.algorithm_entry.mx_quant_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str, callable], neural_compressor.torch.quantization.MXQuantConfig], mode: neural_compressor.common.utils.Mode = Mode.QUANTIZE, *args, **kwargs) torch.nn.Module[source]

The main entry to apply AutoRound quantization.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], AutoRoundConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module

neural_compressor.torch.quantization.algorithm_entry.mixed_precision_entry(model: torch.nn.Module, configs_mapping: Dict[Tuple[str], neural_compressor.torch.quantization.MixedPrecisionConfig], *args, **kwargs) torch.nn.Module[source]

The main entry to apply Mixed Precision.

Parameters:
  • model (torch.nn.Module) – raw fp32 model or prepared model.

  • configs_mapping (Dict[Tuple[str, callable], MixPrecisionConfig]) – per-op configuration.

  • mode (Mode, optional) – select from [PREPARE, CONVERT and QUANTIZE]. Defaults to Mode.QUANTIZE.

Returns:

prepared model or quantized model.

Return type:

torch.nn.Module