Torch
Introduction
neural_compressor.torch
provides a Torch-like API and integrates various model compression methods fine-grained to the torch.nn.Module. Supports a comprehensive range of models, including but not limited to CV models, NLP models, and large language models. A variety of quantization methods are available, including classic INT8 quantization, SmoothQuant, and the popular weight-only quantization. Neural compressor also provides the latest research in simulation work, such as FP8 emulation quantization, MX data type emulation quantization.
In terms of ease of use, neural compressor is committed to providing an easy-to-use user interface and easy to extend the structure design, on the one hand, reuse the PyTorch prepare, convert API, on the other hand, through the Quantizer base class for prepare and convert customization to provide a convenient.
For more details, please refer to link in Neural Compressor discussion space.
So far, neural_compressor.torch
still relies on the backend to generate the quantized model and run it on the corresponding backend, but in the future, neural_compressor is planned to provide generalized device-agnostic Q-DQ model, so as to achieve one-time quantization and arbitrary deployment.
Torch-like APIs
Currently, we provide below three user scenarios, through prepare
&convert
, autotune
and load
APIs.
One-time quantization of the model
Get the best quantized model by setting the search scope and target
Direct deployment of the quantized model
Quantization APIs
def prepare(
model: torch.nn.Module,
quant_config: BaseConfig,
inplace: bool = True,
example_inputs: Any = None,
):
"""Prepare the model for calibration.
Insert observers into the model so that it can monitor the input and output tensors during calibration.
Args:
model (torch.nn.Module): origin model
quant_config (BaseConfig): path to quantization config
inplace (bool, optional): It will change the given model in-place if True.
example_inputs (tensor/tuple/dict, optional): used to trace torch model.
Returns:
prepared and calibrated module.
"""
def convert(
model: torch.nn.Module,
quant_config: BaseConfig = None,
inplace: bool = True,
):
"""Convert the prepared model to a quantized model.
Args:
model (torch.nn.Module): the prepared model
quant_config (BaseConfig, optional): path to quantization config, for special usage.
inplace (bool, optional): It will change the given model in-place if True.
Returns:
The quantized model.
"""
Autotune API
def autotune(
model: torch.nn.Module,
tune_config: TuningConfig,
eval_fn: Callable,
eval_args=None,
run_fn=None,
run_args=None,
example_inputs=None,
):
"""The main entry of auto-tune.
Args:
model (torch.nn.Module): _description_
tune_config (TuningConfig): _description_
eval_fn (Callable): for evaluation of quantized models.
eval_args (tuple, optional): arguments used by eval_fn. Defaults to None.
run_fn (Callable, optional): for calibration to quantize model. Defaults to None.
run_args (tuple, optional): arguments used by run_fn. Defaults to None.
example_inputs (tensor/tuple/dict, optional): used to trace torch model. Defaults to None.
Returns:
The quantized model.
"""
Load API
neural_compressor.torch
links the save function to the quantized model. If model.save
already exists, Neural Compressor renames the previous function to model.orig_save
.
def save(self, output_dir="./saved_results"):
"""
Args:
self (torch.nn.Module): the quantized model.
output_dir (str, optional): path to save the quantized model
"""
def load(output_dir="./saved_results", model=None):
"""The main entry of load for all algorithms.
Args:
output_dir (str, optional): path to quantized model folder. Defaults to "./saved_results".
model (torch.nn.Module, optional): original model, suggest to use empty tensor.
Returns:
The quantized model
"""
Supported Matrix
Method |
Algorithm | Backend | Support Status | Usage Link |
---|---|---|---|---|
Weight Only Quantization |
Round to Nearest (RTN) |
PyTorch eager mode | ✔ | link |
GPTQ |
PyTorch eager mode | ✔ | link | |
AWQ | PyTorch eager mode | ✔ | link | |
AutoRound | PyTorch eager mode | ✔ | link | |
TEQ | PyTorch eager mode | ✔ | link | |
HQQ | PyTorch eager mode | ✔ | link | |
Smooth Quantization | SmoothQuant | intel-extension-for-pytorch | ✔ | link |
Static Quantization | Post-traning Static Quantization | intel-extension-for-pytorch (INT8) | ✔ | link |
TorchDynamo (INT8) | ✔ | link | ||
Intel Gaudi AI accelerator (FP8) | ✔ | link | ||
Dynamic Quantization | Post-traning Dynamic Quantization | TorchDynamo | ✔ | link |
MX Quantization | Microscaling Data Formats for Deep Learning | PyTorch eager mode | ✔ | link |
Mixed Precision | Mixed precision | PyTorch eager mode | ✔ | link |
Quantization Aware Training | Quantization Aware Training | TorchDynamo | stay tuned | stay tuned |
Common Problems
How to choose backend between
intel-extension-for-pytorch
andPyTorchDynamo
?Neural Compressor provides automatic logic to detect which backend should be used.
Environment Automatic Backend import torch torch.dynamo import torch
import intel-extension-for-pytorchintel-extension-for-pytorch How to set different configuration for specific op_name or op_type?
Neural Compressor extends a
set_local
method based on the global configuration object to set custom configuration.def set_local(self, operator_name_or_list: Union[List, str, Callable], config: BaseConfig) -> BaseConfig: """Set custom configuration based on the global configuration object. Args: operator_name_or_list (Union[List, str, Callable]): specific operator config (BaseConfig): specific configuration """
Demo:
quant_config = RTNConfig() # Initialize global configuration with default bits=4 quant_config.set_local(".*mlp.*", RTNConfig(bits=8)) # For layers with "mlp" in their names, set bits=8 quant_config.set_local("Conv1d", RTNConfig(dtype="fp32")) # For Conv1d layers, do not quantize them.
How to specify an accelerator?
Neural Compressor provides automatic accelerator detection, including HPU, XPU, CUDA, and CPU.
The automatically detected accelerator may not be suitable for some special cases, such as poor performance, memory limitations. In such situations, users can override the detected accelerator by setting the environment variable
INC_TARGET_DEVICE
.Usage:
export INC_TARGET_DEVICE=cpu