Performance Tuning Guide ======================== ## Overview Intel® Extension for PyTorch\* is a Python package to extend official PyTorch. It makes the out-of-box user experience of PyTorch CPU better while achieving good performance. To fully utilize the power of Intel® architecture and thus yield high performance, PyTorch, as well as Intel® Extension for PyTorch\*, are powered by [oneAPI Deep Neural Network Library (oneDNN)](https://github.com/oneapi-src/oneDNN), an open-source cross-platform performance library of basic building blocks for deep learning applications. It is developed and optimized for Intel Architecture Processors, Intel Processor Graphics, and Xe architecture-based Graphics. Although default primitives of PyTorch and Intel® Extension for PyTorch\* are highly optimized, there are things users can do improve performance. Most optimized configurations can be automatically set by the launcher script. This article introduces common methods recommended by Intel developers. ## Contents of this Document * [Hardware Configuration](#hardware-configuration) * [Intel CPU Structure](#intel-cpu-structure) * [Non-Uniform Memory Access (NUMA)](#non-uniform-memory-access-numa) * [Software Configuration](#software-configuration) * [Channels Last](#channels-last) * [Numactl](#numactl) * [OpenMP](#openmp) * [OMP_NUM_THREADS](#omp-num-threads) * [OMP_THREAD_LIMIT](#omp-thread-limit) * [GNU OpenMP](#gnu-openmp) * [Intel OpenMP](#intel-openmp) * [Memory Allocator](#memory-allocator) * [Jemalloc](#jemalloc) * [TCMalloc](#tcmalloc) * [Denormal Number](#denormal-number) * [OneDNN primitive cache](#onednn-primitive-cache) ## Hardware Configuration This section briefly introduces the structure of Intel CPUs, as well as concept of Non-Uniform Memory Access (NUMA). ### Intel CPU Structure There are many families of Intel CPUs. We'll use Intel® Xeon® processor Scalable family as an example to discuss an Intel CPU and how it works. Understanding this background knowledge is helpful to understand the PyTorch optimization methodologies that Intel engineers recommend. On the Intel® Xeon® Scalable Processors with Intel® C620 Series Chipsets, (formerly Purley) platform, each chip provides up to 28 cores. Each core has a non-inclusive last-level cache and an 1MB L2 cache. The CPU features fast 2666 MHz DDR4 memory, six memory channels per CPU, Intel Ultra Path Interconnect (UPI) high speed point-to-point processor interconnect, and more. Figure 1 shows microarchitecture of the Intel® Xeon® processor Scalable family chips. Each CPU chip consists of a number of cores, along with core-specific cache. 6 channels of DDR4 memory are connected to the chip directly. Meanwhile, chips communicates through the Intel UPI interconnect, which features a transfer speed of up to 10.4 GT/s.
![Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture](../../../images/performance_tuning_guide/block_diagram_xeon_architecture.png) Figure 1: Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.
Usually, a CPU chip is called a socket. A typical two-socket configuration is illustrated as Figure 2. Two CPU sockets are equipped on one motherboard. Each socket is connected to up to 6 channels of memory, called its local memory, from socket perspective. Sockets are connected to each other via Intel UPI. It is possible for each socket to access memories attached on other sockets, usually called remote memory access. Local memory access is always faster than remote memory access. Meanwhile, cores on one socket share a space of high speed cache memory, which is much faster than communication via Intel UPI. Figure 3 shows an ASUS Z11PA-D8 Intel® Xeon® server motherboard, equipping with two sockets for Intel® Xeon® processor Scalable family CPUs.
![Typical two-socket configuration](../../../images/performance_tuning_guide/two_socket_config.png) Figure 2: Typical two-socket configuration. ![ASUS Z11PA-D8 Intel® Xeon® server motherboard](https://dlcdnimgs.asus.com/websites/global/products/MCCApMgGOdr9WJxN/MB-Z11PAD8-overview-01-s.jpg) Figure 3: An ASUS Z11PA-D8 Intel® Xeon® server motherboard. It contains two sockets for Intel® Xeon® processor Scalable family CPUs.
### Non-Uniform Memory Access (NUMA) It is a good thing that more and more CPU cores are provided to users in one socket, because this brings more computation resources. However, this also brings memory access competitions. Program can stall because memory is busy to visit. To address this problem, Non-Uniform Memory Access (NUMA) was introduced. Comparing to Uniform Memory Access (UMA), in which scenario all memories are connected to all cores equally, NUMA tells memories into multiple groups. Certain number of memories are directly attached to one socket's integrated memory controller to become local memory of this socket. As described in the previous section, local memory access is much faster than remote memory access. Users can get CPU information with `lscpu` command on Linux to learn how many cores, sockets there on the machine. Also, NUMA information like how CPU cores are distributed can also be retrieved. The following is an example of `lscpu` execution on a machine with two Intel(R) Xeon(R) Platinum 8180M CPUs. 2 sockets were detected. Each socket has 28 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads. I.e. each socket has another 28 logical cores. Thus, there are 112 CPU cores on service. When indexing CPU cores, usually physical cores are indexed before logical core. In this case, the first 28 cores (0-27) are physical cores on the first NUMA socket (node), the second 28 cores (28-55) are physical cores on the second NUMA socket (node). Logical cores are indexed afterward. 56-83 are 28 logical cores on the first NUMA socket (node), 84-111 are the second 28 logical cores on the second NUMA socket (node). Typically, running Intel® Extension for PyTorch\* should avoid using logical cores to get a good performance. ``` $ lscpu ... CPU(s): 112 On-line CPU(s) list: 0-111 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 ... Model name: Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz ... NUMA node0 CPU(s): 0-27,56-83 NUMA node1 CPU(s): 28-55,84-111 ... ``` ## Software Configuration This section introduces software configurations that helps to boost performance. ### Channels Last Take advantage of **Channels Last** memory format for image processing tasks. Comparing to PyTorch default NCHW (`torch.contiguous_format`) memory format, NHWC (`torch.channels_last`) is more friendly to Intel platforms, and thus generally yields better performance. More detailed introduction can be found at [Channels Last page](../features/nhwc.md). You can get sample codes with Resnet50 at [Example page](../examples.md). ### Numactl Since NUMA largely influences memory access performance, this functionality should also be implemented in software side. During development of Linux kernels, more and more sophisticated implementations/optimizations/strategies had been brought out. Version 2.5 of the Linux kernel already contained basic NUMA support, which was further improved in subsequent kernel releases. Version 3.8 of the Linux kernel brought a new NUMA foundation that allowed development of more efficient NUMA policies in later kernel releases. Version 3.13 of the Linux kernel brought numerous policies that aim at putting a process near its memory, together with the handling of cases such as having memory pages shared between processes, or the use of transparent huge pages. New sysctl settings allow NUMA balancing to be enabled or disabled, as well as the configuration of various NUMA memory balancing parameters.[1] Behavior of Linux kernels are thus different according to kernel version. Newer Linux kernels may contain further optimizations of NUMA strategies, and thus have better performances. For some workloads, NUMA strategy influences performance great. Linux provides a tool, `numactl`, that allows user control of NUMA policy for processes or shared memory. It runs processes with a specific NUMA scheduling or memory placement policy. As described in previous section, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations. From a memory access perspective, bounding memory access locally is much faster than accessing remote memories. The following is an example of numactl usage to run a workload on the Nth socket and limit memory access to its local memories on the Nth socket. More detailed description of numactl command can be found [on the numactl man page](https://linux.die.net/man/8/numactl). ```numactl --cpunodebind N --membind N python