# Practice Guide ## Overview Intel® Extension for TensorFlow* is a Python package to extend official TensorFlow, achieve higher performance. Although official TensorFlow and the default configuration of Intel® Extension for TensorFlow* perform well, there are still something that users can do for performance optimization on specific platforms. Most optimized configurations can be automatically set by the launcher script. This article introduces common tips that Intel developers recommend to take. ## Table of Contents - [Practice Guide](#cpu-practice-guide) - [Overview](#overview) - [Table of Contents](#table-of-contents) - [CPU Practice Guide](#cpu-practice-guide) - [Hardware Configuration](#hardware-configuration) - [Non-Uniform Memory Access (NUMA)](#non-uniform-memory-access-numa) - [Software Configuration](#software-configuration) - [Memory Layout format](#memory-layout-format) - [Numactl](#numactl) - [OpenMP](#openmp) - [OMP_NUM_THREADS](#omp_num_threads) - [GNU OpenMP](#gnu-openmp) - [Intel OpenMP](#intel-openmp) - [Memory Allocator](#memory-allocator) - [TCMalloc](#tcmalloc) - [GPU Practice Guide](#gpu-practice-guide) ## CPU Practice Guide ### Hardware Configuration This section briefly instroduces structure of Intel CPUs, as well as concept of Non-Uniform Memory Access (NUMA), as background knowledge. #### Non-Uniform Memory Access (NUMA) It is a good thing that more and more CPU cores are provided to users in one socket, because this brings more computation resources. However, this also brings memory access competitions. Program may be stalled as memory is busy. To address this problem, `Non-Uniform Memory Access` (`NUMA`) was introduced. Comparing to `Uniform Memory Access` (`UMA`), where all memories are connected to all cores equally, NUMA divide memories into multiple groups. Certain number of memories are directly attached to one socket's integrated memory controller to become local memory of this socket. While other memories are on other sockets as remote memory. Local memory access is much faster than remote memory access. You can get CPU information with ```lscpu``` command on Linux to get how many cores, sockets on the machine. Also, NUMA information such as how CPU cores are distributed can also be retrieved. The following is an example of ```lscpu``` execution on a machine with two Intel® Xeon® Platinum 8180M CPUs. Two sockets were detected. Each socket has 28 physical cores onboard. Since Hyper-Threading is enabled, each core can run 2 threads. That means each socket has another 28 logical cores. Thus, a tatal of 112 CPU cores available. When indexing CPU cores, usually physical cores are indexed prior to logical core. In this case, the first 28 cores (0-27) are physical cores on the first NUMA socket (node), the second 28 cores (28-55) are physical cores on the second `NUMA` socket (node). Logical cores are indexed afterward. 56-83 are 28 logical cores on the first `NUMA` socket (node), 84-111 are the second 28 logical cores on the second `NUMA` socket (node). Typically, avoid running `Intel® Extension for TensorFlow*` on logical cores if you want to get good performance. ``` $ lscpu ... CPU(s): 112 On-line CPU(s) list: 0-111 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 2 NUMA node(s): 2 ... Model name: Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz ... NUMA node0 CPU(s): 0-27,56-83 NUMA node1 CPU(s): 28-55,84-111 ... ``` ### Software Configuration This section introduces software configurations that helps to boost performance. #### Memory Layout format The default memory layout format of Intel® Extension for TensorFlow* is NHWC format, which is the same with official TensorFlow default format. This format is generally friendly to most of models, but some models may get higher performance with NCHW format by the benefit from `oneDNN` block format. Below environment settings are for two different memory formats mentioned above. ```ITEX_LAYOUT_OPT=0``` ```ITEX_LAYOUT_OPT=1``` #### Numactl Since NUMA largely influences memory access performance, the Linux tool ```numactl```allows users to control NUMA policy for processes or shared memory. It runs processes with a specific NUMA scheduling or memory placement policy. As described in previous section, cores share high-speed cache in one socket, thus it is a good idea to avoid cross socket computations. From memory access perspective, bounding memory access to local ones is much faster than accessing remote memories. The following is an example of numactl usage to run a workload on the Nth socket, and limit memory access to its local memories on the Nth socket. More detailed description of numactl command can be found [Numactl Linux Man Page](https://linux.die.net/man/8/numactl). ```numactl --cpunodebind N --membind N python