Features

Ease-of-use Python API

With only two or three clauses added to your original code, Intel® Extension for PyTorch* provides simple frontend Python APIs and utilities to get performance optimizations such as graph optimization and operator optimization.

Check the API Documentation for details of API functions. Examples are also available.

Note

The package name used when you import Intel® Extension for PyTorch* changed from intel_pytorch_extension (for versions 1.2.0 through 1.9.0) to intel_extension_for_pytorch (for versions 1.10.0 and later). Use the correct package name depending on the version you are using.

Here are detailed discussions of specific feature topics, summarized in the rest of this document:

torch.compile (Experimental, NEW feature from 2.0.0)

PyTorch* 2.0 introduces a new feature, torch.compile, to speed up PyTorch* code. It makes PyTorch code run faster by JIT-compiling PyTorch code into optimized kernels, all while requiring minimal code changes. Intel® Extension for PyTorch* enables a backend, ipex, in the torch.compile to optimize generation of the graph model.

Usage is as simple as importing Intel® Extension for PyTorch* and setting backend parameter of the torch.compile to ipex. While optimizations with torch.compile applies to backend, invocation of ipex.optimize function is highly recommended as well to apply optimizations in frontend.

import torch
import intel_extension_for_pytorch as ipex
...
model = ipex.optimize(model)
model = torch.compile(model, backend='ipex')

ISA Dynamic Dispatching

Intel® Extension for PyTorch* features dynamic dispatching functionality to automatically adapt execution binaries to the most advanced instruction set available on your machine.

For more detailed information, check ISA Dynamic Dispatching.

Auto Channels Last

Comparing to the default NCHW memory format, using channels_last (NHWC) memory format could further accelerate convolutional neural networks. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU operators. More detailed information is available at Channels Last.

Intel® Extension for PyTorch* automatically converts a model to channels last memory format when users optimize the model with ipex.optimize(model). With this feature users won’t need to manually apply model=model.to(memory_format=torch.channels_last) any more. More detailed information is available at Auto Channels Last.

Auto Mixed Precision (AMP)

Low precision data type BFloat16 has been natively supported on 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set. It will also be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set providing further boosted performance. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators has been enabled in Intel® Extension for PyTorch*, and partially upstreamed to PyTorch master branch. These optimizations will be landed in PyTorch master through PRs that are being submitted and reviewed.

For more detailed information, check Auto Mixed Precision (AMP).

Bfloat16 computation can be conducted on platforms with AVX512 instruction set. On platforms with AVX512 BFloat16 instruction, there will be an additional performance boost.

Graph Optimization

To further optimize TorchScript performance, Intel® Extension for PyTorch* supports transparent fusion of frequently used operator patterns such as Conv2D+ReLU and Linear+ReLU. For more detailed information, check Graph Optimization.

Compared to eager mode, graph mode in PyTorch normally yields better performance from optimization methodologies such as operator fusion. Intel® Extension for PyTorch* provides further optimizations in graph mode. We recommend you take advantage of Intel® Extension for PyTorch* with TorchScript. You may wish to run with the torch.jit.trace() function first, since it generally works better with Intel® Extension for PyTorch* than using the torch.jit.script() function. More detailed information can be found at the pytorch.org website.

Operator Optimization

Intel® Extension for PyTorch* also optimizes operators and implements several customized operators for performance boosts. A few ATen operators are replaced by their optimized counterparts in Intel® Extension for PyTorch* via the ATen registration mechanism. Some customized operators are implemented for several popular topologies. For instance, ROIAlign and NMS are defined in Mask R-CNN. To improve performance of these topologies, Intel® Extension for PyTorch* also optimized these customized operators.

class ipex.nn.FrozenBatchNorm2d(num_features: int, eps: float = 1e-05)

BatchNorm2d where the batch statistics and the affine parameters are fixed

Parameters: num_features (int) – $C$ from an expected input of size $(N, C, H, W)$

Shape

Input: $(N, C, H, W)$
Output: $(N, C, H, W)$ (same shape as input)

ipex.nn.functional.interaction(*args)

Get the interaction feature beyond different kinds of features (like gender or hobbies), used in DLRM model.

For now, we only optimized “dot” interaction at DLRM Github repo. Through this, we use the dot product to represent the interaction feature between two features.

For example, if feature 1 is “Man” which is represented by [0.1, 0.2, 0.3], and feature 2 is “Like play football” which is represented by [-0.1, 0.3, 0.2].

The dot interaction feature is ([0.1, 0.2, 0.3] * [-0.1, 0.3, 0.2]^T) = -0.1 + 0.6 + 0.6 = 1.1

Parameters: *args – Multiple tensors which represent different features

Shape

Input: $N * (B, D)$ , where N is the number of different kinds of features, B is the batch size, D is feature size
Output: $(B, D + N * (N - 1) / 2)$

class ipex.nn.modules.MergedEmbeddingBag(embedding_specs: List[EmbeddingSpec])

Merge multiple Pytorch EmbeddingBag objects into a single torch.nn.Module object.

At the current stage:

MergedEmbeddingBag assumes to be constructed from nn.EmbeddingBag with sparse=False, returns dense gradients.

MergedEmbeddingBagWithSGD does not return gradients, backward step and weights update step are fused.

Native usage of multiple EmbeddingBag objects is:

>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
>>> inputs = [in1, in2, in3, ..., in_m]
>>> outputs = []
>>> for i in range(len(EmbLists)):
>>>     outputs.append(Emb[in_i])

The optimized path is:

>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
>>> merged_emb = MergedEmbeddingBagWithSGD.from_embeddingbag_list(EmbLists)
>>> outputs = MergedEmbeddingBagWithSGD(inputs)

Computation benefits from the optimized path:

1). Pytorch OP dispatching overhead is minimized. If EmbeddingBag operations are not heavy, this dispatching overhead brings big impact.

2). Parallelizations over embedding tables are merged into that over a single merged embedding table. This could benefit low parallelization efficiency scenarios when data size read out from embedding tables are not large enough.

A linearize_indices_and_offsets step is introduced to merge indices/offsets together. Consider that EmbeddingBag objects are usually the first layer of a model, the linearize_indices_and_offsets step can be considered as “data preprocess” and can be done offline. See usage of the linearize_indices_and_offsets in MergedEmbeddingBagWithSGD.

Now MergedEmbeddingBagWithSGD is the only option running with an optimizer. We plan to add more optimizer support in the future. Visit MergedEmbeddingBagWithSGD for introduction of MergedEmbeddingBagWith[Optimizer].

class ipex.nn.modules.MergedEmbeddingBagWithSGD(embedding_specs: List[EmbeddingSpec], lr: float = 0.01, weight_decay: float = 0)

To support training with MergedEmbeddingBag for good performance, optimizer step is fused with backward function.

Native usage for multiple EmbeddingBag is:

>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
>>> sgd = torch.optim.SGD(EmbLists.parameters(), lr=lr, weight_decay=weight_decay)
>>> inputs = [in1, in2, in3, ..., in_m]
>>> outputs = []
>>> for i in range(len(EmbLists)):
>>>     outputs.append(Emb[in_i])
>>> sgd.zero_grad()
>>> for i in range(len(outputs)):
>>>     out.backward(grads[i])
>>> sgd.step()

The optimized path is:

>>> # create MergedEmbeddingBagWithSGD module with optimizer args (lr and weight decay)
>>> EmbLists = torch.nn.Modulist(emb1, emb2, emb3, ..., emb_m)
>>> merged_emb = MergedEmbeddingBagWithSGD.from_embeddingbag_list(EmbLists, lr=lr, weight_decay=weight_decay)
>>> # if you need to train with BF16 dtype, we provide split sgd on it
>>> # merged_emb.to_bfloat16_train()
>>> merged_input = merged_emb.linearize_indices_and_offsets(inputs)
>>> outputs = MergedEmbeddingBagWithSGD(merged_input, need_linearize_indices_and_offsets=torch.BoolTensor([False]))
>>> outputs.backward(grads)

Training benefits further from this optimization:

1). Pytorch OP dispatching overhead in backward and weight update process is saved.

2). Thread loading becomes more balanced during backward/weight update. In real world scenarios, Embedingbag are often used to represent categorical features, while the categorical features often fit power law distribution. For example, if we use one embedding table to represent the age range of a video game website users, we might find most of them are between 10-19 or 20-29. So we may need to update the row which represent 10-19 or 20-29 frequently. Since updating these rows needs to write at the same memory address, we need to write it by 1 thread (otherwise we will have write conflict or overhead to solve the conflict). The potential memory write conflict can be simply addressed by merging multiple tables together.

3). Weights update is fused with backward together. We can immediately update the weight right after we get gradients from the backward step and thus the memory access pattern becomes more friendly. Data access will happen on cache more than on memory.

Auto kernel selection is a feature that enables users to tune for better performance with GEMM operations. It is provided as parameter –auto_kernel_selection, with boolean value, of the ipex.optimize() function. By default, the GEMM kernel is computed with oneMKL primitives. However, under certain circumstances oneDNN primitives run faster. Users are able to set –auto_kernel_selection to True to run GEMM kernels with oneDNN primitives.” -> “We aim to provide good default performance by leveraging the best of math libraries and enabled weights_prepack, and it has been verified with broad set of models. If you would like to try other alternatives, you can use auto_kernel_selection toggle in ipex.optimize to switch, and you can disable weights_preack in ipex.optimize if you are concerning the memory footprint more than performance gain. However in majority cases, keeping default is what we recommend.

Optimizer Optimization

Optimizers are one of key parts of the training workloads. Intel® Extension for PyTorch* brings two types of optimizations to optimizers:

Operator fusion for the computation in the optimizers.
SplitSGD for BF16 training, which reduces the memory footprint of the master weights by half.

For more detailed information, check Optimizer Fusion and Split SGD

Runtime Extension

Intel® Extension for PyTorch* Runtime Extension provides PyTorch frontend APIs for users to get finer-grained control of the thread runtime and provides:

Multi-stream inference via the Python frontend module MultiStreamModule.
Spawn asynchronous tasks from both Python and C++ frontend.
Program core bindings for OpenMP threads from both Python and C++ frontend.

Note

Intel® Extension for PyTorch* Runtime extension is still in the experimental stage. The API is subject to change. More detailed descriptions are available in the API Documentation.

For more detailed information, check Runtime Extension.

INT8 Quantization

Intel® Extension for PyTorch* provides built-in quantization recipes to deliver good statistical accuracy for most popular DL workloads including CNN, NLP and recommendation models.

Users are always recommended to try quantization with the built-in quantization recipe first with Intel® Extension for PyTorch* quantization APIs. For even higher accuracy demandings, users can try with separate recipe tuning APIs. The APIs are powered by Intel® Neural Compressor to take advantage of its tuning feature.

Check more detailed information for INT8 Quantization and INT8 recipe tuning API guide (Experimental, *NEW feature in 1.13.0*).

Codeless Optimization (Experimental, NEW feature from 1.13.0)

This feature enables users to get performance benefits from Intel® Extension for PyTorch* without changing Python scripts. It hopefully eases the usage and has been verified working well with broad scope of models, though in few cases there could be small overhead comparing to applying optimizations with Intel® Extension for PyTorch* APIs.

For more detailed information, check Codeless Optimization.

Graph Capture (Experimental, NEW feature from 1.13.0)

Since graph mode is key for deployment performance, this feature automatically captures graphs based on set of technologies that PyTorch supports, such as TorchScript and TorchDynamo. Users won’t need to learn and try different PyTorch APIs to capture graphs, instead, they can turn on a new boolean flag –graph_mode (default off) in ipex.optimize to get the best of graph optimization.

For more detailed information, check Graph Capture.

HyperTune (Experimental, NEW feature from 1.13.0)

HyperTune is an experimental feature to perform hyperparameter/execution configuration searching. The searching is used in various areas such as optimization of hyperparameters of deep learning models. The searching is extremely useful in real situations when the number of hyperparameters, including configuration of script execution, and their search spaces are huge that manually tuning these hyperparameters/configuration is impractical and time consuming. Hypertune automates this process of execution configuration searching for the launcher and Intel® Extension for PyTorch*.

For more detailed information, check HyperTune.

Fast BERT Optimization (Experimental, NEW feature from 2.0.0)

Intel proposed a technique to speed up BERT workloads. Implementation is integrated into Intel® Extension for PyTorch*. An API ipex.fast_bert is provided for a simple usage.

For more detailed information, check Fast BERT.