Features

Device-Agnostic

Easy-to-use Python API

Intel® Extension for PyTorch* provides simple frontend Python APIs and utilities to get performance optimizations such as operator optimization.

Check the API Documentation for API functions description and Examples for usage guidance.

Channels Last

Compared with the default NCHW memory format, using channels_last (NHWC) memory format can further accelerate convolutional neural networks. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU and GPU operators. More detailed information is available at Channels Last.

Intel® Extension for PyTorch* automatically converts a model to channels last memory format when users optimize the model with ipex.optimize(model). With this feature, users do not need to manually apply model=model.to(memory_format=torch.channels_last) anymore. However, models running on Intel® Data Center GPU Flex Series will choose oneDNN layout, so users still need to manually convert the model and data to channels last format. More detailed information is available at Auto Channels Last.

Auto Mixed Precision (AMP)

Benefiting from less memory usage and computation, low precision data types typically speed up both training and inference workloads. Furthermore, accelerated by Intel® native hardware instructions, including Intel® Deep Learning Boost (Intel® DL Boost) on the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake), as well as the Intel® Advanced Matrix Extensions (Intel® AMX) instruction set on the 4th next generation of Intel® Xeon® Scalable Processors (aka Sapphire Rapids), low precision data type, bfloat 16 and float16, provide further boosted performance. We recommend to use AMP for accelerating convolutional and matmul based neural networks.

The support of Auto Mixed Precision (AMP) with BFloat16 on CPU and BFloat16 optimization of operators has been enabled in Intel® Extension for PyTorch*, and partially upstreamed to PyTorch master branch. These optimizations will be landed in PyTorch master through PRs that are being submitted and reviewed. On GPU side, support of BFloat16 and Float16 are both available in Intel® Extension for PyTorch*. BFloat16 is the default low precision floating data type when AMP is enabled.

Detailed information of AMP for GPU and CPU are available at Auto Mixed Precision (AMP) on GPU and Auto Mixed Precision (AMP) on CPU respectively.

Quantization

Intel® Extension for PyTorch* provides built-in INT8 quantization recipes to deliver good statistical accuracy for most popular DL workloads including CNN, NLP and recommendation models on CPU side. On top of that, if users would like to tune for a higher accuracy than what the default recipe provides, a recipe tuning API powered by Intel® Neural Compressor is provided for users to try.

Check more detailed information for INT8 Quantization [CPU] and INT8 recipe tuning API guide (Experimental, *NEW feature in 1.13.0* on CPU) on CPU side.

Check more detailed information for INT8 Quantization [XPU].

On Intel® GPUs, Intel® Extension for PyTorch* also provides INT4 and FP8 Quantization. Check more detailed information for FP8 Quantization and INT4 Quantization

Distributed Training

To meet demands of large scale model training over multiple devices, distributed training on Intel® GPUs and CPUs are supported. Two alternative methodologies are available. Users can choose either to use PyTorch native distributed training module, Distributed Data Parallel (DDP), with Intel® oneAPI Collective Communications Library (oneCCL) support via Intel® oneCCL Bindings for PyTorch (formerly known as torch_ccl) or use Horovod with Intel® oneAPI Collective Communications Library (oneCCL) support (Experimental).

For more detailed information, check DDP and Horovod (Experimental).

GPU-Specific

DLPack Solution

DLPack defines a stable in-memory data structure for sharing tensors among frameworks. It enables sharing of tensor data without copying when interoparating with other libraries. Intel® Extension for PyTorch* extends DLPack support in PyTorch* for XPU device particularly.

For more detailed information, check DLPack Solution.

DPC++ Extension

Intel® Extension for PyTorch* provides C++ APIs to get SYCL queue and configure floating-point math mode.

Check the API Documentation for the details of API functions. DPC++ Extension describes how to write customized DPC++ kernels with a practical example and build it with setuptools and CMake.

Advanced Configuration

The default settings for Intel® Extension for PyTorch* are sufficient for most use cases. However, if you need to customize Intel® Extension for PyTorch*, advanced configuration is available at build time and runtime.

For more detailed information, check Advanced Configuration.

A driver environment variable ZE_FLAT_DEVICE_HIERARCHY is currently used to select the device hierarchy model with which the underlying hardware is exposed. By default, each GPU tile is used as a device. Check the Level Zero Specification Documentation for more details.

Fully Sharded Data Parallel (FSDP)

Fully Sharded Data Parallel (FSDP) is a PyTorch* module that provides industry-grade solution for large model training. FSDP is a type of data parallel training, unlike DDP, where each process/worker maintains a replica of the model, FSDP shards model parameters, optimizer states and gradients across DDP ranks to reduce the GPU memory footprint used in training. This makes the training of some large-scale models feasible.

For more detailed information, check FSDP.

Inductor

Intel® Extension for PyTorch* now empowers users to seamlessly harness graph compilation capabilities for optimal PyTorch model performance on Intel GPU via the flagship torch.compile API through the default “inductor” backend (TorchInductor ).

For more detailed information, check Inductor.

Legacy Profiler Tool (Experimental)

The legacy profiler tool is an extension of PyTorch* legacy profiler for profiling operators’ overhead on XPU devices. With this tool, you can get the information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch* with profiler support as default and enable this tool by adding a with statement before the code segment.

For more detailed information, check Legacy Profiler Tool.

Simple Trace Tool (Experimental)

Simple Trace is a built-in debugging tool that lets you control printing out the call stack for a piece of code. Once enabled, it can automatically print out verbose messages of called operators in a stack format with indenting to distinguish the context.

For more detailed information, check Simple Trace Tool.

Kineto Supported Profiler Tool (Experimental)

The Kineto supported profiler tool is an extension of PyTorch* profiler for profiling operators’ executing time cost on GPU devices. With this tool, you can get information in many fields of the run models or code scripts. Build Intel® Extension for PyTorch* with Kineto support as default and enable this tool using the with statement before the code segment.

For more detailed information, check Profiler Kineto.

Compute Engine (Experimental feature for debug)

Compute engine is a experimental feature which provides the capacity to choose specific backend for operators with multiple implementations.

For more detailed information, check Compute Engine.

CPU-Specific

Operator Optimization

Intel® Extension for PyTorch* also optimizes operators and implements several customized operators for performance boosts. A few ATen operators are replaced by their optimized counterparts in Intel® Extension for PyTorch* via the ATen registration mechanism. Some customized operators are implemented for several popular topologies. For instance, ROIAlign and NMS are defined in Mask R-CNN. To improve performance of these topologies, Intel® Extension for PyTorch* also optimized these customized operators.

class intel_extension_for_pytorch.nn.FrozenBatchNorm2d(num_features: int, eps: float = 1e-05)

BatchNorm2d where the batch statistics and the affine parameters are fixed

Parameters:

num_features (int) – \(C\) from an expected input of size \((N, C, H, W)\)

Shape
  • Input: \((N, C, H, W)\)

  • Output: \((N, C, H, W)\) (same shape as input)

intel_extension_for_pytorch.nn.functional.interaction(*args)

Get the interaction feature beyond different kinds of features (like gender or hobbies), used in DLRM model.

For now, we only optimized “dot” interaction at DLRM Github repo. Through this, we use the dot product to represent the interaction feature between two features.

For example, if feature 1 is “Man” which is represented by [0.1, 0.2, 0.3], and feature 2 is “Like play football” which is represented by [-0.1, 0.3, 0.2].

The dot interaction feature is ([0.1, 0.2, 0.3] * [-0.1, 0.3, 0.2]^T) = -0.1 + 0.6 + 0.6 = 1.1

Parameters:

*args – Multiple tensors which represent different features

Shape
  • Input: \(N * (B, D)\), where N is the number of different kinds of features,

    B is the batch size, D is feature size

  • Output: \((B, D + N * ( N - 1 ) / 2)\)

Auto kernel selection is a feature that enables users to tune for better performance with GEMM operations. It is provided as parameter –auto_kernel_selection, with boolean value, of the ipex.optimize() function. By default, the GEMM kernel is computed with oneMKL primitives. However, under certain circumstances oneDNN primitives run faster. Users are able to set –auto_kernel_selection to True to run GEMM kernels with oneDNN primitives.” -> “We aim to provide good default performance by leveraging the best of math libraries and enabled weights_prepack, and it has been verified with broad set of models. If you would like to try other alternatives, you can use auto_kernel_selection toggle in ipex.optimize to switch, and you can disable weights_preack in ipex.optimize if you are concerning the memory footprint more than performance gain. However in majority cases, keeping default is what we recommend.

Runtime Extension

Intel® Extension for PyTorch* Runtime Extension provides PyTorch frontend APIs for users to get finer-grained control of the thread runtime and provides:

  • Multi-stream inference via the Python frontend module MultiStreamModule.

  • Spawn asynchronous tasks from both Python and C++ frontend.

  • Program core bindings for OpenMP threads from both Python and C++ frontend.

Note

Intel® Extension for PyTorch* Runtime extension is still in the experimental stage. The API is subject to change. More detailed descriptions are available in the API Documentation.

For more detailed information, check Runtime Extension.

Codeless Optimization (Experimental, NEW feature in 1.13.*)

This feature enables users to get performance benefits from Intel® Extension for PyTorch* without changing Python scripts. It hopefully eases the usage and has been verified working well with broad scope of models, though in few cases there could be small overhead comparing to applying optimizations with Intel® Extension for PyTorch* APIs.

For more detailed information, check Codeless Optimization.

Graph Capture (Experimental, NEW feature in 1.13.0*)

Since graph mode is key for deployment performance, this feature automatically captures graphs based on set of technologies that PyTorch supports, such as TorchScript and TorchDynamo. Users won’t need to learn and try different PyTorch APIs to capture graphs, instead, they can turn on a new boolean flag –graph_mode (default off) in ipex.optimize to get the best of graph optimization.

For more detailed information, check Graph Capture.

HyperTune (Experimental, NEW feature in 1.13.0*)

HyperTune is an experimental feature to perform hyperparameter/execution configuration searching. The searching is used in various areas such as optimization of hyperparameters of deep learning models. The searching is extremely useful in real situations when the number of hyperparameters, including configuration of script execution, and their search spaces are huge that manually tuning these hyperparameters/configuration is impractical and time consuming. Hypertune automates this process of execution configuration searching for the launcher and Intel® Extension for PyTorch*.

For more detailed information, check HyperTune.