Torch ================================================= 1. [Introduction](#introduction) 2. [Torch-like APIs](#torch-like-apis) 3. [Support matrix](#supported-matrix) 4. [Common Problems](#common-problems) ## Introduction `neural_compressor.torch` provides a Torch-like API and integrates various model compression methods fine-grained to the torch.nn.Module. Supports a comprehensive range of models, including but not limited to CV models, NLP models, and large language models. A variety of quantization methods are available, including classic INT8 quantization, SmoothQuant, and the popular weight-only quantization. Neural compressor also provides the latest research in simulation work, such as FP8 emulation quantization, MX data type emulation quantization. In terms of ease of use, neural compressor is committed to providing an easy-to-use user interface and easy to extend the structure design, on the one hand, reuse the PyTorch prepare, convert API, on the other hand, through the Quantizer base class for prepare and convert customization to provide a convenient. For more details, please refer to [link](https://github.com/intel/neural-compressor/discussions/1527) in Neural Compressor discussion space. So far, `neural_compressor.torch` still relies on the backend to generate the quantized model and run it on the corresponding backend, but in the future, neural_compressor is planned to provide generalized device-agnostic Q-DQ model, so as to achieve one-time quantization and arbitrary deployment. ## Torch-like APIs Currently, we provide below three user scenarios, through `prepare`&`convert`, `autotune` and `load` APIs. - One-time quantization of the model - Get the best quantized model by setting the search scope and target - Direct deployment of the quantized model ### Quantization APIs ```python def prepare( model: torch.nn.Module, quant_config: BaseConfig, inplace: bool = True, example_inputs: Any = None, ): """Prepare the model for calibration. Insert observers into the model so that it can monitor the input and output tensors during calibration. Args: model (torch.nn.Module): origin model quant_config (BaseConfig): path to quantization config inplace (bool, optional): It will change the given model in-place if True. example_inputs (tensor/tuple/dict, optional): used to trace torch model. Returns: prepared and calibrated module. """ ``` ```python def convert( model: torch.nn.Module, quant_config: BaseConfig = None, inplace: bool = True, ): """Convert the prepared model to a quantized model. Args: model (torch.nn.Module): the prepared model quant_config (BaseConfig, optional): path to quantization config, for special usage. inplace (bool, optional): It will change the given model in-place if True. Returns: The quantized model. """ ``` ### Autotune API ```python def autotune( model: torch.nn.Module, tune_config: TuningConfig, eval_fn: Callable, eval_args=None, run_fn=None, run_args=None, example_inputs=None, ): """The main entry of auto-tune. Args: model (torch.nn.Module): _description_ tune_config (TuningConfig): _description_ eval_fn (Callable): for evaluation of quantized models. eval_args (tuple, optional): arguments used by eval_fn. Defaults to None. run_fn (Callable, optional): for calibration to quantize model. Defaults to None. run_args (tuple, optional): arguments used by run_fn. Defaults to None. example_inputs (tensor/tuple/dict, optional): used to trace torch model. Defaults to None. Returns: The quantized model. """ ``` ### Load API `neural_compressor.torch` links the save function to the quantized model. If `model.save` already exists, Neural Compressor renames the previous function to `model.orig_save`. ```python def save(self, output_dir="./saved_results"): """ Args: self (torch.nn.Module): the quantized model. output_dir (str, optional): path to save the quantized model """ ``` ```python def load(output_dir="./saved_results", model=None): """The main entry of load for all algorithms. Args: output_dir (str, optional): path to quantized model folder. Defaults to "./saved_results". model (torch.nn.Module, optional): original model, suggest to use empty tensor. Returns: The quantized model """ ``` ## Supported Matrix
Method |
Algorithm | Backend | Support Status | Usage Link |
---|---|---|---|---|
Weight Only Quantization |
Round to Nearest (RTN) |
PyTorch eager mode | ✔ | link |
GPTQ |
PyTorch eager mode | ✔ | link | |
AWQ | PyTorch eager mode | ✔ | link | |
AutoRound | PyTorch eager mode | ✔ | link | |
TEQ | PyTorch eager mode | ✔ | link | |
HQQ | PyTorch eager mode | ✔ | link | |
Smooth Quantization | SmoothQuant | intel-extension-for-pytorch | ✔ | link |
Static Quantization | Post-traning Static Quantization | intel-extension-for-pytorch (INT8) | ✔ | link |
TorchDynamo (INT8) | ✔ | link | ||
Intel Gaudi AI accelerator (FP8) | ✔ | link | ||
Dynamic Quantization | Post-traning Dynamic Quantization | TorchDynamo | ✔ | link |
MX Quantization | Microscaling Data Formats for Deep Learning | PyTorch eager mode | ✔ | link |
Mixed Precision | Mixed precision | PyTorch eager mode | ✔ | link |
Quantization Aware Training | Quantization Aware Training | TorchDynamo | stay tuned | stay tuned |
Environment | Automatic Backend |
---|---|
import torch | torch.dynamo |
import torch import intel-extension-for-pytorch |
intel-extension-for-pytorch |