# Practice Guide ## Overview Intel® Extension for TensorFlow* is a Python package that extends the official TensorFlow, in order to achieve improved performance. Although both official TensorFlow and the default configuration of Intel® Extension for TensorFlow* perform well, there are additional steps you can take to optimize performance on specific platforms. Most optimized configurations can be set automatically by the launcher script. This article covers common tips recommended by Intel developers. ## Table of Contents - [Practice Guide](#cpu-practice-guide) - [Overview](#overview) - [Table of Contents](#table-of-contents) - [CPU Practice Guide](#cpu-practice-guide) - [Hardware Configuration](#hardware-configuration) - [Non-Uniform Memory Access (NUMA)](#non-uniform-memory-access-numa) - [Software Configuration](#software-configuration) - [Memory Layout format](#memory-layout-format) - [Numactl](#numactl) - [OpenMP](#openmp) - [OMP_NUM_THREADS](#omp_num_threads) - [GNU OpenMP](#gnu-openmp) - [Intel OpenMP](#intel-openmp) - [Memory Allocator](#memory-allocator) - [TCMalloc](#tcmalloc) - [GPU Practice Guide](#gpu-practice-guide) ## CPU Practice Guide ### Hardware Configuration This section briefly introduces the structure of Intel CPUs, as well as the concept of Non-Uniform Memory Access (NUMA), for background knowledge. #### Non-Uniform Memory Access (NUMA) More and more CPU cores are being provided in one socket, which provides the benefit of greater computation resources. However, this also causes memory access competition. Programs may stall due to the memory being busy. To address this problem, `Non-Uniform Memory Access` (`NUMA`) was introduced. Compared to `Uniform Memory Access` (`UMA`), where all memories are connected to all cores equally, NUMA divides memories into multiple groups. A certain number of memories are directly attached to one socket's integrated memory controller, to become local memory of this socket, while other memories become remote memory to other sockets. Local memory access is much faster than remote memory access. You can get CPU information with the ```lscpu``` command on Linux to see how many cores and sockets are on the machine, as well as NUMA information such as how CPU cores are distributed. The following is an example of ```lscpu``` execution on a machine with two Intel® Xeon® Platinum 8180M CPUs. Two sockets were detected. Each socket has 28 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads. That means each socket has another 28 logical cores. Thus, a total of 112 CPU cores are available. When indexing CPU cores, physical cores are typically indexed before logical cores. In this case, the first 28 cores (0-27) are physical cores on the first NUMA socket (node), while the second 28 cores (28-55) are physical cores on the second `NUMA` socket (node). Logical cores are indexed afterward. 56-83 are 28 logical cores on the first `NUMA` socket (node), and 84-111 are the second 28 logical cores on the second `NUMA` socket (node). Typically, running `Intel® Extension for TensorFlow*` on logical cores can negatively impact performance and should therefore be avoided. ``` $ lscpu ... CPU(s): 112 On-line CPU(s) list: 0-111 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 ... Model name: Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz ... NUMA node0 CPU(s): 0-27,56-83 NUMA node1 CPU(s): 28-55,84-111 ... ``` ### Software Configuration This section introduces software configurations that help to boost performance. #### Memory Layout format The default memory layout format of Intel® Extension for TensorFlow* is NHWC format, much like the official TensorFlow default format. This format is generally friendly to most models, but some models may find higher performance with the NCHW format by benefitting from the `oneDNN` block format. The below environment settings are for the two different memory formats mentioned above. ```ITEX_LAYOUT_OPT=0``` ```ITEX_LAYOUT_OPT=1``` #### Numactl Since NUMA largely influences memory access performance, the Linux tool ```numactl``` allows you to control NUMA policy for processes or shared memory. It runs processes with a specific NUMA scheduling or memory placement policy. As described in the previous section, cores share high-speed caches in one socket, thus it is a good idea to avoid cross socket computations. From a memory access perspective, bounding memory access to local memories is much faster than accessing remote memories. The following is an example of numactl usage to run a workload on the Nth socket, and limit memory access to its local memories on the Nth socket. More detailed description of numactl command can be found on the [Numactl Linux Man Page](https://linux.die.net/man/8/numactl). ```numactl --cpunodebind N --membind N python