Launch Script User Guide

Overview
Common Execution Mode
- Latency Mode
- Throughput Mode
Basic Settings
- Launch Log
Advanced Settings
Examples

Overview

As introduced in the Practice Guide, there are several factors that influence performance. Setting configuration options properly contributes to a performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations. A launch script is provided to automate these configuration settings to free users from this complicated work. This guide helps you to learn the launch script common usage and provides examples that cover many optimized configuration cases as well.

The configurations are mainly around the following perspectives.

NUMA Control: numactl specifies NUMA scheduling and memory placement policy
Number of instances: [Single instance (default) | Multiple instances]
Memory allocator: [TCMalloc | JeMalloc | default Malloc] If unspecified, launcher will choose for user.

Common Execution Mode

The launch script is provided as a module of Intel® Extension for TensorFlow*. Run the following command to use it. If no knob is given, your script will be executed using all physical cores.

python -m intel_extension_for_tensorflow.python.launch [knobs] <your_script> [your_script_args]

In most cases for better performance, --latency_mode or --throughput_mode is often enabled. The launcher script will automatically calculate the number of instances and number of cores used for each instance, so no manual setting is required. If you want to customize your execution, see Advanced Setting.

Latency mode

With --latency_mode, each instance uses 4 cores and all physical cores are used. This knob is mutually exclusive with --throughput_mode.

python -m intel_extension_for_tensorflow.python.launch --latency_mode infer_resnet50.py

Throughput mode

With --throughput_mode, one numa node corresponds to one instance and all physical cores are used. This knob is mutually exclusive with --latency_mode.

python -m intel_extension_for_tensorflow.python.launch --throughput_mode infer_resnet50.py

Basic Settings

Launch Log

The launch script execution creates log files under a designated log directory, so that you can conduct some investigations afterward. By default, creating logs is disabled to avoid undesired log files. You can enable logging by setting knob --log_path to be:

directory to save log files. Both absolute path and relative path are supported.
types of log files to generate. One file (<prefix>_timestamp_instances.log) contains command and information when the script was launched. Another type of file (<prefix>_timestamp_instance_N_core#-core#....log) contain stdout print of each instance.

For example:

run_20210712212258_instances.log
run_20210712212258_instance_0_cores_0-43.log

Advanced Settings

The following table lists all available knobs.

Knob	Type	Default Value	Description
`-m`, `--module`	-	None	Changes each process to interpret the launch script as a python module, executing with the same behavior as `python -m`.
`--no_python`	BOOLEAN	False	Useful when the script is not a Python script. Do not prepend your script with `python`, execute it directly.
`--latency_mode`	BOOLEAN	False	By default each instance uses 4 cores and all physical cores are used.
`--throughput_mode`	BOOLEAN	False	By default one numa node corresponds to one instance and all physical cores are used.
`--log_path`	STRING	""	The log file path. Default path is '', which means disable logging to files.
`--log_file_prefix`	STRING	"run"	Log file prefix.

If you want to set the number of instances, core allocation or some environment variables yourself, use the knobs described below, which are all exclusive to --latency_mode and --throughput_mode.

Multi-instance

You may want to launch multiple instances for better performance, for example, when batch size is small. The following knobs will be helpful.

Knob	Type	Default Value	Description
`--ninstances`	INTEGER	-1	Number of instances.
`--instance_idx`	INTEGER	-1	Run specified instance_idx instance among multiple instances (instance index starts at index 0). Useful when running each instance independently.
`--ncore_per_instance`	INTEGER	-1	Cores per instance.

NUMA Control

These knobs are used to set the NUMA policy to better utilize your hardware resource.

Knob	Type	Default Value	Description
`--node_id`	INTEGER	-1	Run on the specified node (node index starts at index 0).
`--skip_cross_node_cores`	BOOLEAN	False	When specifying --ncore_per_instance, set --skip_cross_node_cores to skip any cross-node cores.
`--disable_numactl`	BOOLEAN	False	Disable numactl.
`--disable_taskset`	BOOLEAN	False	Disable taskset.
`--use_logical_core`	BOOLEAN	False	Whether use logical cores.
`--core_list`	STRING	None	Specify the core list as `core_id, core_id, ....`.

Memory Allocator

This script provides users three memory allocator types, specified with the following knobs. If not specified, the script will automatically check the installation of allocators on the execution machine, and then select in the order of TCMalloc/JeMalloc/Default Malloc.

Knob	Type	Default Value	Description
`--enable_tcmalloc`	BOOLEAN	False	Enable tcmalloc allocator. Ensure TCMalloc is installed before use.
`--enable_jemalloc`	BOOLEAN	False	Enable jemalloc allocator. Ensure JeMalloc is installed before use.
`--use_default_allocator`	BOOLEAN	False	Use default memory allocator.

Environment Variables

The launch script respects existing environment variables on launch. If you prefer some certain environment variables, you can set them before executing the launch script. Intel OpenMP library uses an environment variable KMP_AFFINITY to control its behavior. Different settings bring different performance. By default, the launch script will set KMP_AFFINITY to “granularity=fine,verbose,compact,1,0” or “granularity=fine,verbose,compact,” depending on whether hyper threading is on or off. If you want to try other values, you can use export command on Linux to set KMP_AFFINITY before you run the launch script. In this case, the script will not set the default value but take the existing value of KMP_AFFINITY, and print a message to stdout.

Our launcher also automatically sets some environment variables related to TensorFlow and Intel® Extension for TensorFlow*. By default, TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS are set to 1 and number of cores per instance. ITEX AMP and Intel® Extension for TensorFlow* layout optimization are disabled. Users can change them by the following knobs.

Knob	Type	Default Value	Description
`--enable_op_parallelism`	BOOLEAN	False	When set to `True`, it sets environment variable `ITEX_OMP_THREADPOOL=0`.
`--tf_num_intraop_threads`	STRING	None	By Default, this argument is None, and set environment variable `TF_NUM_INTRAOP_THREADS` as the number of cores per instance.
`--tf_num_interop_threads`	STRING	None	By Default, this argument is None, and set environment variable `TF_NUM_INTEROP_THREADS`=1.
`--enable_itex_amp`	BOOLEAN	False	Set environment variable `ITEX_AUTO_MIXED_PRECISION=1`.
`--enable_itex_layout_opt`	BOOLEAN	False	Set environment variable `ITEX_LAYOUT_OPT=0` or `1`.

Examples

Example script infer_resnet50.py will be used in this guide.

Single instance for inference
Multiple instances for inference
Set environment variables for inference
- IX. TF_NUM_INTRAOP_THREADS
- X. TF_NUM_INTEROP_THREADS
Usage of Jemalloc/TCMalloc/Default memory allocator

Single instance for inference

I. Use all physical cores

python -m intel_extension_for_tensorflow.python.launch --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009103552_instance_0_cores_0-43.log
    └── run_20221009103552_instances.log

The run_20221009103552_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009103552_instances.log
2022-10-09 10:35:53,136 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:35:53,136 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 10:35:53,136 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:35:53,136 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:35:53,136 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:35:53,136 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 10:35:53,136 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:35:53,136 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:35:53,137 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009103552_instance_0_cores_0-95.log

II. Use all cores including logical cores

python -m intel_extension_for_tensorflow.python.launch --use_logical_core --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009104740_instances.log
    └── run_20221009104740_instance_0_cores_0-191.log

The run_20221009104740_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009104740_instances.log
2022-10-09 10:47:40,908 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:47:40,909 - __main__ - INFO - OMP_NUM_THREADS=192
2022-10-09 10:47:40,909 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:47:40,909 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:47:40,909 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:47:40,909 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=192
2022-10-09 10:47:40,909 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:47:40,909 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:47:40,909 - __main__ - INFO - numactl --localalloc -C 0-191 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009104740_instance_0_cores_0-191.log

III. Use physical cores on one node

python -m intel_extension_for_tensorflow.python.launch --node_id 1 --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009105044_instances.log
    └── run_20221009105044_instance_0_cores_12-23.log

The run_20221009105044_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009105044_instances.log
2022-10-09 10:50:44,693 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:50:44,693 - __main__ - INFO - OMP_NUM_THREADS=12
2022-10-09 10:50:44,693 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:50:44,693 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:50:44,693 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:50:44,693 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=12
2022-10-09 10:50:44,693 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:50:44,693 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:50:44,694 - __main__ - INFO - numactl --localalloc -C 12-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105044_instance_0_cores_12-23.log

IV. Use your designated number of cores

python -m intel_extension_for_tensorflow.python.launch --ninstances 1 --ncore_per_instance 10 --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009105320_instances.log
    └── run_20221009105320_instance_0_cores_0-9.log

The run_20221009105320_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009105320_instances.log
2022-10-09 10:53:21,089 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:53:21,089 - __main__ - INFO - OMP_NUM_THREADS=10
2022-10-09 10:53:21,089 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:53:21,089 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:53:21,089 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:53:21,089 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=10
2022-10-09 10:53:21,089 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:53:21,089 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:53:21,090 - __main__ - INFO - numactl --localalloc -C 0-9 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105320_instance_0_cores_0-9.log

Multiple instances for inference

V. Throughput mode

python -m intel_extension_for_tensorflow.python.launch --throughput_mode --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009105838_instances.log
    ├── run_20221009105838_instance_0_cores_0-11.log
    ├── run_20221009105838_instance_1_cores_12-23.log
    ├── run_20221009105838_instance_2_cores_24-35.log
    ├── run_20221009105838_instance_3_cores_36-47.log
    ├── run_20221009105838_instance_4_cores_48-59.log
    ├── run_20221009105838_instance_5_cores_60-71.log
    ├── run_20221009105838_instance_6_cores_72-83.log
    └── run_20221009105838_instance_7_cores_84-95.log

The run_20221009105838_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009105838_instances.log
2022-10-09 10:58:38,757 - __main__ - WARNING - --throughput_mode is exclusive to --ninstances, --ncore_per_instance, --node_id and --use_logical_core. They won't take effect even if they are set explicitly.
2022-10-09 10:58:38,772 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:58:38,772 - __main__ - INFO - OMP_NUM_THREADS=12
2022-10-09 10:58:38,772 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:58:38,772 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:58:38,772 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:58:38,772 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=12
2022-10-09 10:58:38,772 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:58:38,772 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:58:38,772 - __main__ - INFO - numactl --localalloc -C 0-11 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_0_cores_0-11.log
2022-10-09 10:58:38,784 - __main__ - INFO - numactl --localalloc -C 12-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_1_cores_12-23.log
2022-10-09 10:58:38,795 - __main__ - INFO - numactl --localalloc -C 24-35 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_2_cores_24-35.log
2022-10-09 10:58:38,806 - __main__ - INFO - numactl --localalloc -C 36-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_3_cores_36-47.log
2022-10-09 10:58:38,817 - __main__ - INFO - numactl --localalloc -C 48-59 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_4_cores_48-59.log
2022-10-09 10:58:38,828 - __main__ - INFO - numactl --localalloc -C 60-71 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_5_cores_60-71.log
2022-10-09 10:58:38,839 - __main__ - INFO - numactl --localalloc -C 72-83 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_6_cores_72-83.log
2022-10-09 10:58:38,850 - __main__ - INFO - numactl --localalloc -C 84-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_7_cores_84-95.log

VI. Latency mode

python -m intel_extension_for_tensorflow.python.launch --latency_mode --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009110327_instances.log
    ├── run_20221009110327_instance_0_cores_0-3.log
    ├── run_20221009110327_instance_1_cores_4-7.log
    ├── run_20221009110327_instance_2_cores_8-11.log
    ├── run_20221009110327_instance_3_cores_12-15.log
    ├── run_20221009110327_instance_4_cores_16-19.log
    ├── run_20221009110327_instance_5_cores_20-23.log
    ├── run_20221009110327_instance_6_cores_24-27.log
    ├── run_20221009110327_instance_7_cores_28-31.log
    ├── run_20221009110327_instance_8_cores_32-35.log
    ├── run_20221009110327_instance_9_cores_36-39.log
    ├── run_20221009110327_instance_10_cores_40-43.log
    ├── run_20221009110327_instance_11_cores_44-47.log
    ├── run_20221009110327_instance_12_cores_48-51.log
    ├── run_20221009110327_instance_13_cores_52-55.log
    ├── run_20221009110327_instance_14_cores_56-59.log
    ├── run_20221009110327_instance_15_cores_60-63.log
    ├── run_20221009110327_instance_16_cores_64-67.log
    ├── run_20221009110327_instance_17_cores_68-71.log
    ├── run_20221009110327_instance_18_cores_72-75.log
    ├── run_20221009110327_instance_19_cores_76-79.log
    ├── run_20221009110327_instance_20_cores_80-83.log
    ├── run_20221009110327_instance_21_cores_84-87.log
    ├── run_20221009110327_instance_22_cores_88-91.log
    └── run_20221009110327_instance_23_cores_92-95.log

The run_20221009110327_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009110327_instances.log
2022-10-09 11:03:27,198 - __main__ - WARNING - --latency_mode is exclusive to --ninstances, --ncore_per_instance, --node_id and --use_logical_core. They won't take effect even if they are set explicitly.
2022-10-09 11:03:27,215 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:03:27,215 - __main__ - INFO - OMP_NUM_THREADS=4
2022-10-09 11:03:27,215 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:03:27,215 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:03:27,215 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:03:27,215 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=4
2022-10-09 11:03:27,215 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:03:27,215 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:03:27,216 - __main__ - INFO - numactl --localalloc -C 0-3 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_0_cores_0-3.log
2022-10-09 11:03:27,229 - __main__ - INFO - numactl --localalloc -C 4-7 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_1_cores_4-7.log
2022-10-09 11:03:27,241 - __main__ - INFO - numactl --localalloc -C 8-11 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_2_cores_8-11.log
2022-10-09 11:03:27,254 - __main__ - INFO - numactl --localalloc -C 12-15 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_3_cores_12-15.log
2022-10-09 11:03:27,266 - __main__ - INFO - numactl --localalloc -C 16-19 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_4_cores_16-19.log
2022-10-09 11:03:27,278 - __main__ - INFO - numactl --localalloc -C 20-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_5_cores_20-23.log
2022-10-09 11:03:27,290 - __main__ - INFO - numactl --localalloc -C 24-27 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_6_cores_24-27.log
2022-10-09 11:03:27,302 - __main__ - INFO - numactl --localalloc -C 28-31 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_7_cores_28-31.log
2022-10-09 11:03:27,315 - __main__ - INFO - numactl --localalloc -C 32-35 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_8_cores_32-35.log
2022-10-09 11:03:27,327 - __main__ - INFO - numactl --localalloc -C 36-39 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_9_cores_36-39.log
2022-10-09 11:03:27,339 - __main__ - INFO - numactl --localalloc -C 40-43 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_10_cores_40-43.log
2022-10-09 11:03:27,351 - __main__ - INFO - numactl --localalloc -C 44-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_11_cores_44-47.log
2022-10-09 11:03:27,364 - __main__ - INFO - numactl --localalloc -C 48-51 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_12_cores_48-51.log
2022-10-09 11:03:27,376 - __main__ - INFO - numactl --localalloc -C 52-55 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_13_cores_52-55.log
2022-10-09 11:03:27,388 - __main__ - INFO - numactl --localalloc -C 56-59 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_14_cores_56-59.log
2022-10-09 11:03:27,400 - __main__ - INFO - numactl --localalloc -C 60-63 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_15_cores_60-63.log
2022-10-09 11:03:27,413 - __main__ - INFO - numactl --localalloc -C 64-67 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_16_cores_64-67.log
2022-10-09 11:03:27,425 - __main__ - INFO - numactl --localalloc -C 68-71 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_17_cores_68-71.log
2022-10-09 11:03:27,438 - __main__ - INFO - numactl --localalloc -C 72-75 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_18_cores_72-75.log
2022-10-09 11:03:27,452 - __main__ - INFO - numactl --localalloc -C 76-79 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_19_cores_76-79.log
2022-10-09 11:03:27,465 - __main__ - INFO - numactl --localalloc -C 80-83 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_20_cores_80-83.log
2022-10-09 11:03:27,480 - __main__ - INFO - numactl --localalloc -C 84-87 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_21_cores_84-87.log
2022-10-09 11:03:27,494 - __main__ - INFO - numactl --localalloc -C 88-91 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_22_cores_88-91.log
2022-10-09 11:03:27,509 - __main__ - INFO - numactl --localalloc -C 92-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_23_cores_92-95.log

VII. Your designated number of instances

python -m intel_extension_for_tensorflow.python.launch --ninstances 4 --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009110849_instances.log
    ├── run_20221009110849_instance_0_cores_0-10.log
    ├── run_20221009110849_instance_1_cores_11-21.log
    ├── run_20221009110849_instance_2_cores_22-32.log
    └── run_20221009110849_instance_3_cores_33-43.log

The run_20221009110849_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009110849_instances.log
2022-10-09 11:08:49,891 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:08:49,891 - __main__ - INFO - OMP_NUM_THREADS=24
2022-10-09 11:08:49,891 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:08:49,891 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:08:49,892 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:08:49,892 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=24
2022-10-09 11:08:49,892 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:08:49,892 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:08:49,892 - __main__ - INFO - numactl --localalloc -C 0-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_0_cores_0-23.log
2022-10-09 11:08:49,908 - __main__ - INFO - numactl --localalloc -C 24-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_1_cores_24-47.log
2022-10-09 11:08:49,930 - __main__ - INFO - numactl --localalloc -C 48-71 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_2_cores_48-71.log
2022-10-09 11:08:49,951 - __main__ - INFO - numactl --localalloc -C 72-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_3_cores_72-95.log

VIII. Your designated number of instances and instance index

Launcher by default runs all ninstances for multi-instance inference and training as shown above. You can specify instance_idx to independently run that instance only among ninstances

python -m intel_extension_for_tensorflow.python.launch --ninstances 4 --instance_idx 0 --log_path ./logs infer_resnet50.py

you can confirm usage in log file:

2022-10-09 11:10:34,586 - __main__ - INFO - assigning 24 cores for instance 0
2022-10-09 11:10:34,604 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:10:34,604 - __main__ - INFO - OMP_NUM_THREADS=24
2022-10-09 11:10:34,605 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:10:34,605 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:10:34,605 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:10:34,605 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=24
2022-10-09 11:10:34,605 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:10:34,605 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:10:34,605 - __main__ - INFO - numactl --localalloc -C 0-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111034_instance_0_cores_0-23.log

python -m intel_extension_for_tensorflow.python.launch --ninstances 4 --instance_idx 1 --log_path ./logs infer_resnet50.py

you can confirm usage in log file:

2022-10-09 11:12:40,129 - __main__ - INFO - assigning 24 cores for instance 1
2022-10-09 11:12:40,144 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:12:40,144 - __main__ - INFO - OMP_NUM_THREADS=24
2022-10-09 11:12:40,144 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:12:40,144 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:12:40,144 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:12:40,144 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=24
2022-10-09 11:12:40,144 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:12:40,144 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:12:40,145 - __main__ - INFO - numactl --localalloc -C 24-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111239_instance_0_cores_24-47.log

Set environment variables for inference

IX. Set environment variable TF_NUM_INTRAOP_THREADS

python -m intel_extension_for_tensorflow.python.launch --tf_num_intraop_threads 8 --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009111753_instances.log
    ├── run_20221009111753_instance_0_cores_0-95.log

The run_20221009111753_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009111753_instances.log
2022-10-09 11:17:53,947 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:17:53,947 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:17:53,947 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:17:53,947 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:17:53,948 - __main__ - INFO - TF_NUM_INTEROP_THREADS=2
2022-10-09 11:17:53,948 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:17:53,948 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:17:53,948 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:17:53,948 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111753_instance_0_cores_0-95.log

X. Set environment variable TF_NUM_INTEROP_THREADS

python -m intel_extension_for_tensorflow.python.launch --tf_num_interop_threads 2 --log_path ./logs infer_resnet50.py

Check your log directory, its structure is as below.

.
├── infer_resnet50.py
└── logs
    ├── run_20221009111951_instances.log
    └── run_20221009111951_instance_0_cores_0-95.log

The run_20221009111951_instances.log contains information and command that were used for this execution launch.

$ cat logs/run_20221009111951_instances.log
2022-10-09 11:19:51,404 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:19:51,405 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:19:51,405 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:19:51,405 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:19:51,405 - __main__ - INFO - TF_NUM_INTEROP_THREADS=2
2022-10-09 11:19:51,405 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:19:51,405 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:19:51,405 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:19:51,405 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111951_instance_0_cores_0-95.log

Usage of TCMalloc/Jemalloc/Default memory allocator

Memory allocator can influence performance. If users do not designate a desired memory allocator, the launch script searches them in the order of TCMalloc > Jemalloc > Tensorflow default memory allocator, and takes the first matched one.

Jemalloc

Note: You can set your preferred value to MALLOC_CONF before running the launch script if you do not want to use its default setting.

python -m intel_extension_for_tensorflow.python.launch --enable_jemalloc --log_path ./logs infer_resnet50.py

you can confirm usage in log file:

2022-10-09 11:27:20,549 - __main__ - INFO - Use JeMalloc memory allocator
2022-10-09 11:27:20,550 - __main__ - INFO - MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto
2022-10-09 11:27:20,550 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:27:20,550 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:27:20,550 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:27:20,550 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:27:20,550 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:27:20,550 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:27:20,550 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:27:20,550 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009112720_instance_0_cores_0-95.log

TCMalloc

python -m intel_extension_for_tensorflow.python.launch --enable_tcmalloc --log_path ./logs infer_resnet50.py

you can confirm usage in log file:

2022-10-09 11:29:05,206 - __main__ - INFO - Use TCMalloc memory allocator
2022-10-09 11:29:05,207 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:29:05,207 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:29:05,207 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:29:05,207 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:29:05,207 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:29:05,207 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:29:05,207 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:29:05,207 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009112905_instance_0_cores_0-95.log

Default memory allocator

python -m intel_extension_for_tensorflow.python.launch --use_default_allocator --log_path ./logs infer_resnet50.py

you can confirm usage in log file:

2022-10-09 11:29:56,911 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:29:56,911 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:29:56,911 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:29:56,911 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:29:56,911 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:29:56,911 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:29:56,911 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:29:56,911 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009112956_instance_0_cores_0-95.log