Torch

Introduction
Torch-like APIs
Support matrix
Common Problems

Introduction

neural_compressor.torch provides a Torch-like API and integrates various model compression methods fine-grained to the torch.nn.Module. Supports a comprehensive range of models, including but not limited to CV models, NLP models, and large language models. A variety of quantization methods are available, including classic INT8 quantization, SmoothQuant, and the popular weight-only quantization. Neural compressor also provides the latest research in simulation work, such as FP8 emulation quantization, MX data type emulation quantization.

In terms of ease of use, neural compressor is committed to providing an easy-to-use user interface and easy to extend the structure design, on the one hand, reuse the PyTorch prepare, convert API, on the other hand, through the Quantizer base class for prepare and convert customization to provide a convenient.

For more details, please refer to link in Neural Compressor discussion space.

So far, neural_compressor.torch still relies on the backend to generate the quantized model and run it on the corresponding backend, but in the future, neural_compressor is planned to provide generalized device-agnostic Q-DQ model, so as to achieve one-time quantization and arbitrary deployment.

Torch-like APIs

Currently, we provide below three user scenarios, through prepare&convert, autotune and load APIs.

One-time quantization of the model
Get the best quantized model by setting the search scope and target
Direct deployment of the quantized model

Quantization APIs

def prepare(
    model: torch.nn.Module,
    quant_config: BaseConfig,
    inplace: bool = True,
    example_inputs: Any = None,
):
    """Prepare the model for calibration.

    Insert observers into the model so that it can monitor the input and output tensors during calibration.

    Args:
        model (torch.nn.Module): origin model
        quant_config (BaseConfig): path to quantization config
        inplace (bool, optional): It will change the given model in-place if True.
        example_inputs (tensor/tuple/dict, optional): used to trace torch model.

    Returns:
        prepared and calibrated module.
    """

def convert(
    model: torch.nn.Module,
    quant_config: BaseConfig = None,
    inplace: bool = True,
):
    """Convert the prepared model to a quantized model.

    Args:
        model (torch.nn.Module): the prepared model
        quant_config (BaseConfig, optional): path to quantization config, for special usage.
        inplace (bool, optional): It will change the given model in-place if True.

    Returns:
        The quantized model.
    """

Autotune API

def autotune(
    model: torch.nn.Module,
    tune_config: TuningConfig,
    eval_fn: Callable,
    eval_args=None,
    run_fn=None,
    run_args=None,
    example_inputs=None,
):
    """The main entry of auto-tune.

    Args:
        model (torch.nn.Module): _description_
        tune_config (TuningConfig): _description_
        eval_fn (Callable): for evaluation of quantized models.
        eval_args (tuple, optional): arguments used by eval_fn. Defaults to None.
        run_fn (Callable, optional): for calibration to quantize model. Defaults to None.
        run_args (tuple, optional): arguments used by run_fn. Defaults to None.
        example_inputs (tensor/tuple/dict, optional): used to trace torch model. Defaults to None.

    Returns:
        The quantized model.
    """

Load API

neural_compressor.torch links the save function to the quantized model. If model.save already exists, Neural Compressor renames the previous function to model.orig_save.

def save(self, output_dir="./saved_results"):
"""
    Args:
        self (torch.nn.Module): the quantized model.
        output_dir (str, optional): path to save the quantized model 
"""

def load(output_dir="./saved_results", model=None):
    """The main entry of load for all algorithms.

    Args:
        output_dir (str, optional): path to quantized model folder. Defaults to "./saved_results".
        model (torch.nn.Module, optional): original model, suggest to use empty tensor.

    Returns:
        The quantized model
    """

Supported Matrix

Method	Algorithm	Backend	Support Status	Usage Link
Weight Only Quantization	Round to Nearest (RTN)	PyTorch eager mode	✔	link
	GPTQ	PyTorch eager mode	✔	link
	AWQ	PyTorch eager mode	✔	link
	AutoRound	PyTorch eager mode	✔	link
	TEQ	PyTorch eager mode	✔	link
	HQQ	PyTorch eager mode	✔	link
Smooth Quantization	SmoothQuant	intel-extension-for-pytorch	✔	link
Static Quantization	Post-traning Static Quantization	intel-extension-for-pytorch (INT8)	✔	link
		TorchDynamo (INT8)	✔	link
		Intel Gaudi AI accelerator (FP8)	✔	link
Dynamic Quantization	Post-traning Dynamic Quantization	TorchDynamo	✔	link
MX Quantization	Microscaling Data Formats for Deep Learning	PyTorch eager mode	✔	link
Mixed Precision	Mixed precision	PyTorch eager mode	✔	link
Quantization Aware Training	Quantization Aware Training	TorchDynamo	stay tuned	stay tuned

Common Problems

How to choose backend between intel-extension-for-pytorch and PyTorchDynamo?

Neural Compressor provides automatic logic to detect which backend should be used.

Environment Automatic Backend

import torch torch.dynamo

import torch
import intel-extension-for-pytorch intel-extension-for-pytorch

How to set different configuration for specific op_name or op_type?

Neural Compressor extends a set_local method based on the global configuration object to set custom configuration.

def set_local(self, operator_name_or_list: Union[List, str, Callable], config: BaseConfig) -> BaseConfig:
    """Set custom configuration based on the global configuration object.

    Args:
        operator_name_or_list (Union[List, str, Callable]): specific operator
        config (BaseConfig): specific configuration
    """

Demo:

quant_config = RTNConfig()  # Initialize global configuration with default bits=4
quant_config.set_local(".*mlp.*", RTNConfig(bits=8))  # For layers with "mlp" in their names, set bits=8
quant_config.set_local("Conv1d", RTNConfig(dtype="fp32"))  # For Conv1d layers, do not quantize them.

How to specify an accelerator?

Neural Compressor provides automatic accelerator detection, including HPU, Intel GPU, CUDA, and CPU.

The automatically detected accelerator may not be suitable for some special cases, such as poor performance, memory limitations. In such situations, users can override the detected accelerator by setting the environment variable INC_TARGET_DEVICE.

Usage:
```
export INC_TARGET_DEVICE=cpu
```

Environment	Automatic Backend
import torch	torch.dynamo
import torch import intel-extension-for-pytorch	intel-extension-for-pytorch