Technical Details

ISA Dynamic Dispatching [CPU]

Intel® Extension for PyTorch* features dynamic dispatching functionality to automatically adapt execution binaries to the most advanced instruction set available on your machine.

For more detailed information, check ISA Dynamic Dispatching.

Graph Optimization [CPU]

To further optimize TorchScript performance, Intel® Extension for PyTorch* supports transparent fusion of frequently used operator patterns such as Conv2D+ReLU and Linear+ReLU. For more detailed information, check Graph Optimization.

Compared to eager mode, graph mode in PyTorch normally yields better performance from optimization methodologies such as operator fusion. Intel® Extension for PyTorch* provides further optimizations in graph mode. We recommend you take advantage of Intel® Extension for PyTorch* with TorchScript. You may wish to run with the torch.jit.trace() function first, since it generally works better with Intel® Extension for PyTorch* than using the torch.jit.script() function. More detailed information can be found at the pytorch.org website.

Optimizer Optimization [CPU, GPU]

Optimizers are a key part of the training workloads. Intel® Extension for PyTorch* brings two types of optimizations to optimizers:

Operator fusion for the computation in the optimizers. [CPU, GPU]
SplitSGD for BF16 training, which reduces the memory footprint of the master weights by half. [CPU]

For more detailed information, check Optimizer Fusion on CPU, Optimizer Fusion on GPU and Split SGD.

Ahead of Time Compilation (AOT) [GPU]

AOT Compilation is a helpful feature for development lifecycle or distribution time, when you know beforehand what your target device is going to be at application execution time. When AOT compilation is enabled, no additional compilation time is needed when running application. It also benifits the product quality since no just-in-time (JIT) bugs encountered as JIT is skipped and final code executing on the target device can be tested as-is before delivery to end-users. The disadvantage of this feature is that the final distributed binary size will be increased a lot (e.g. from 500MB to 2.5GB for Intel® Extension for PyTorch*).

Memory Management [GPU]

Intel® Extension for PyTorch* uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without any overhead. Allocations are associated with a sycl device. The allocator attempts to find the smallest cached block that will fit the requested size from the reserved block pool. If it unable to find a appropriate memory block inside of already allocated ares, the allocator will delegate to allocate a new block memory.

For more detailed information, check Memory Management.