API Documentation ================= ## Introduction IntelĀ® Neural Compressor is an open-source Python library designed to help users quickly deploy low-precision inference solutions on popular deep learning (DL) frameworks such as TensorFlow*, PyTorch*, MXNet, and ONNX Runtime. It automatically optimizes low-precision recipes for deep learning models in order to achieve optimal product objectives, such as inference performance and memory usage, with expected accuracy criteria. ## User-facing APIs These APIs are intended to unify low-precision quantization interfaces cross multiple DL frameworks for the best out-of-the-box experiences. > **Note** > > Neural Compressor is continuously improving user-facing APIs to create a better user experience. > Two sets of user-facing APIs exist. One is the default one supported from Neural Compressor v1.0 for backwards compatibility. The other set consists of new APIs in the `neural_compressor.experimental` package. > We recommend that you use the APIs located in neural_compressor.experimental. All examples have been updated to use the experimental APIs. The major differences between the default user-facing APIs and the experimental APIs are: 1. The experimental APIs abstract the `neural_compressor.experimental.common.Model` concept to cover those cases whose weight and graph files are stored separately. 2. The experimental APIs unify the calling style of the `Quantization`, `Pruning`, and `Benchmark` classes by setting model, calibration dataloader, evaluation dataloader, and metric through class attributes rather than passing them as function inputs. 3. The experimental APIs refine Neural Compressor built-in transforms/datasets/metrics by unifying the APIs cross different framework backends. ## Experimental user-facing APIs Experimental user-facing APIs consist of the following components: ### Quantization-related APIs ```python # neural_compressor.experimental.Quantization class Quantization(object): def __init__(self, conf_fname_or_obj): ... def __call__(self): ... @property def calib_dataloader(self): ... @property def eval_dataloader(self): ... @property def model(self): ... @property def metric(self): ... @property def postprocess(self, user_postprocess): ... @property def q_func(self): ... @property def eval_func(self): ... ``` The `conf_fname_or_obj` parameter used in the class initialization is the path to the user yaml configuration file or Quantization_Conf class. This yaml file is used to control the entire tuning behavior on the model. **Neural Compressor User YAML Syntax** > IntelĀ® Neural Compressor provides template yaml files for [Post-Training Quantization](../neural_compressor/template/ptq.yaml), [Quantization-Aware Training](../neural_compressor/template/qat.yaml), and [Pruning](../neural_compressor/template/pruning.yaml) scenarios. Refer to these template files to understand the meaning of each field. > Note that most fields in the yaml templates are optional. View the [HelloWorld Yaml](../examples/helloworld/tf_example2/conf.yaml) example for reference. ```python # Typical Launcher code from neural_compressor.experimental import Quantization, common # optional if Neural Compressor built-in dataset could be used as model input in yaml class dataset(object): def __init__(self, *args): ... def __getitem__(self, idx): # return single sample and label tuple without collate. label should be 0 for label-free case ... def len(self): ... # optional if Neural Compressor built-in metric could be used to do accuracy evaluation on model output in yaml class custom_metric(object): def __init__(self): ... def update(self, predict, label): # metric update per mini-batch ... def result(self): # final metric calculation invoked only once after all mini-batch are evaluated # return a scalar to neural_compressor for accuracy-driven tuning. # by default the scalar is higher-is-better. if not, set tuning.accuracy_criterion.higher_is_better to false in yaml. ... quantizer = Quantization(conf.yaml) quantizer.model = '/path/to/model' # below two lines are optional if Neural Compressor built-in dataset is used as model calibration input in yaml cal_dl = dataset('/path/to/calibration/dataset') quantizer.calib_dataloader = common.DataLoader(cal_dl, batch_size=32) # below two lines are optional if Neural Compressor built-in dataset is used as model evaluation input in yaml dl = dataset('/path/to/evaluation/dataset') quantizer.eval_dataloader = common.DataLoader(dl, batch_size=32) # optional if Neural Compressor built-in metric could be used to do accuracy evaluation in yaml quantizer.metric = common.Metric(custom_metric) q_model = quantizer.fit() q_model.save('/path/to/output/dir') ``` `model` attribute in `Quantization` class is an abstraction of model formats across different frameworks. Neural Compressor supports passing the path of `keras model`, `frozen pb`, `checkpoint`, `saved model`, `torch.nn.model`, `mxnet.symbol.Symbol`, `gluon.HybirdBlock`, and `onnx model` to instantiate a `neural_compressor.experimental.` class and set to `quantizer.model`. `calib_dataloader` and `eval_dataloader` attribute in `Quantization` class is used to set up a calibration dataloader by code. It is optional to set if the user sets corresponding fields in yaml. `metric` attribute in `Quantization` class is used to set up a custom metric by code. It is optional to set if user finds Neural Compressor built-in metric could be used with their model and sets corresponding fields in yaml. `postprocess` attribute in `Quantization` class is not necessary in most of the use cases. It is only needed when the user wants to use the built-in metric but the model output can not directly be handled by Neural Compressor built-in metrics. In this case, the user can register a transform to convert the model output to the expected one required by the built-in metric. `q_func` attribute in `Quantization` class is only for `Quantization Aware Training` case, in which the user needs to register a function that takes `model` as the input parameter and executes the entire training process with self-contained training hyper-parameters. `eval_func` attribute in `Quantization` class is reserved for special cases. If the user had an evaluation function when train a model, the user must implement a `calib_dataloader` and leave `eval_dataloader` as None. Then, modify this evaluation function to take `model` as the input parameter and return a higher-is-better scaler. In some scenarios, it may reduce development effort. ### Pruning-related APIs (POC) ```python class Pruning(object): def __init__(self, conf_fname_or_obj): ... def on_epoch_begin(self, epoch): ... def on_step_begin(self, batch_id): ... def on_step_end(self): ... def on_epoch_end(self): ... def __call__(self): ... @property def model(self): ... @property def q_func(self): ... ``` This API is used to do sparsity pruning. Currently, it is a Proof of Concept; Neural Compressor only supports `magnitude pruning` on PyTorch. To learn how to use this API, refer to the [pruning document](../docs/pruning.md). ### Benchmarking-related APIs ```python class Benchmark(object): def __init__(self, conf_fname_or_obj): ... def __call__(self): ... @property def model(self): ... @property def metric(self): ... @property def b_dataloader(self): ... @property def postprocess(self, user_postprocess): ... ``` This API is used to measure model performance and accuracy. To learn how to use this API, refer to the [benchmarking document](../docs/benchmark.md). ## Default user-facing APIs The default user-facing APIs exist for backwards compatibility from the v1.0 release. Refer to [v1.1 API](https://github.com/intel/neural-compressor/blob/v1.1/docs/introduction.md) to understand how the default user-facing APIs work. View the [HelloWorld example](/examples/helloworld/tf_example6) that uses default user-facing APIs for user reference. Full examples using default user-facing APIs can be found [here](https://github.com/intel/neural-compressor/tree/v1.1/examples).