Launch Script User Guide¶
Overview¶
As introduced in the Practice Guide, there are several factors that influence performance. Setting configuration options properly contributes to a performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations. A launch script is provided to automate these configuration settings to free users from this complicated work. This guide helps you to learn the launch script common usage and provides examples that cover many optimized configuration cases as well.
The configurations are mainly around the following perspectives.
NUMA Control: numactl specifies NUMA scheduling and memory placement policy
Number of instances: [Single instance (default) | Multiple instances]
Memory allocator: [TCMalloc | JeMalloc | default Malloc] If unspecified, launcher will choose for user.
Common Execution Mode¶
The launch script is provided as a module of Intel® Extension for TensorFlow*. Run the following command to use it. If no knob is given, your script will be executed using all physical cores.
python -m intel_extension_for_tensorflow.python.launch [knobs] <your_script> [your_script_args]
In most cases for better performance, --latency_mode
or --throughput_mode
is often enabled. The launcher script will automatically calculate the number of instances and number of cores used for each instance, so no manual setting is required. If you want to customize your execution, see Advanced Setting.
Latency mode¶
With --latency_mode
, each instance uses 4 cores and all physical cores are used. This knob is mutually exclusive with --throughput_mode
.
python -m intel_extension_for_tensorflow.python.launch --latency_mode infer_resnet50.py
Throughput mode¶
With --throughput_mode
, one numa node corresponds to one instance and all physical cores are used. This knob is mutually exclusive with --latency_mode
.
python -m intel_extension_for_tensorflow.python.launch --throughput_mode infer_resnet50.py
Basic Settings¶
Launch Log¶
The launch script execution creates log files under a designated log directory, so that you can conduct some investigations afterward. By default, creating logs is disabled to avoid undesired log files. You can enable logging by setting knob --log_path
to be:
directory to save log files. Both absolute path and relative path are supported.
types of log files to generate. One file (
<prefix>_timestamp_instances.log
) contains command and information when the script was launched. Another type of file (<prefix>_timestamp_instance_N_core#-core#....log
) contain stdout print of each instance.
For example:
run_20210712212258_instances.log
run_20210712212258_instance_0_cores_0-43.log
Advanced Settings¶
The following table lists all available knobs.
Knob | Type | Default Value | Description |
---|---|---|---|
-m , --module |
- | None | Changes each process to interpret the launch script as a python module, executing with the same behavior as python -m . |
--no_python |
BOOLEAN | False | Useful when the script is not a Python script. Do not prepend your script with python , execute it directly. |
--latency_mode |
BOOLEAN | False | By default each instance uses 4 cores and all physical cores are used. |
--throughput_mode |
BOOLEAN | False | By default one numa node corresponds to one instance and all physical cores are used. |
--log_path |
STRING | "" | The log file path. Default path is '', which means disable logging to files. |
--log_file_prefix |
STRING | "run" | Log file prefix. |
If you want to set the number of instances, core allocation or some environment variables yourself, use the knobs described below, which are all exclusive to --latency_mode
and --throughput_mode
.
Multi-instance¶
You may want to launch multiple instances for better performance, for example, when batch size is small. The following knobs will be helpful.
Knob | Type | Default Value | Description |
---|---|---|---|
--ninstances |
INTEGER | -1 | Number of instances. |
--instance_idx |
INTEGER | -1 | Run specified instance_idx instance among multiple instances (instance index starts at index 0). Useful when running each instance independently. |
--ncore_per_instance |
INTEGER | -1 | Cores per instance. |
NUMA Control¶
These knobs are used to set the NUMA policy to better utilize your hardware resource.
Knob | Type | Default Value | Description |
---|---|---|---|
--node_id |
INTEGER | -1 | Run on the specified node (node index starts at index 0). |
--skip_cross_node_cores |
BOOLEAN | False | When specifying --ncore_per_instance, set --skip_cross_node_cores to skip any cross-node cores. |
--disable_numactl |
BOOLEAN | False | Disable numactl. |
--disable_taskset |
BOOLEAN | False | Disable taskset. |
--use_logical_core |
BOOLEAN | False | Whether use logical cores. |
--core_list |
STRING | None | Specify the core list as core_id, core_id, .... . |
Memory Allocator¶
This script provides users three memory allocator types, specified with the following knobs. If not specified, the script will automatically check the installation of allocators on the execution machine, and then select in the order of TCMalloc/JeMalloc/Default Malloc.
Knob | Type | Default Value | Description |
---|---|---|---|
--enable_tcmalloc |
BOOLEAN | False | Enable tcmalloc allocator. Ensure TCMalloc is installed before use. |
--enable_jemalloc |
BOOLEAN | False | Enable jemalloc allocator. Ensure JeMalloc is installed before use. |
--use_default_allocator |
BOOLEAN | False | Use default memory allocator. |
Environment Variables¶
The launch script respects existing environment variables on launch. If you prefer some certain environment variables, you can set them before executing the launch script. Intel OpenMP library uses an environment variable KMP_AFFINITY
to control its behavior. Different settings bring different performance. By default, the launch script will set KMP_AFFINITY
to “granularity=fine,verbose,compact,1,0” or “granularity=fine,verbose,compact,” depending on whether hyper threading is on or off. If you want to try other values, you can use export
command on Linux to set KMP_AFFINITY
before you run the launch script. In this case, the script will not set the default value but take the existing value of KMP_AFFINITY
, and print a message to stdout.
Our launcher also automatically sets some environment variables related to TensorFlow and Intel® Extension for TensorFlow*. By default, TF_NUM_INTEROP_THREADS
and TF_NUM_INTRAOP_THREADS
are set to 1
and number of cores per instance. ITEX AMP and Intel® Extension for TensorFlow* layout optimization are disabled.
Users can change them by the following knobs.
Knob | Type | Default Value | Description |
---|---|---|---|
--tf_num_intraop_threads |
STRING | None | By Default, this argument is None, and set environment variable TF_NUM_INTRAOP_THREADS as the number of cores per instance. |
--tf_num_interop_threads |
STRING | None | By Default, this argument is None, and set environment variable TF_NUM_INTEROP_THREADS =1. |
--enable_itex_amp |
BOOLEAN | False | Set environment variable ITEX_AUTO_MIXED_PRECISION=1 . |
--enable_itex_layout_opt |
BOOLEAN | False | Set environment variable ITEX_LAYOUT_OPT=0 or 1 . |
Examples¶
Example script infer_resnet50.py will be used in this guide.
Single instance for inference
Multiple instances for inference
Set environment variables for inference
Usage of Jemalloc/TCMalloc/Default memory allocator
Single instance for inference¶
I. Use all physical cores¶
python -m intel_extension_for_tensorflow.python.launch --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009103552_instance_0_cores_0-43.log
└── run_20221009103552_instances.log
The run_20221009103552_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009103552_instances.log
2022-10-09 10:35:53,136 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:35:53,136 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 10:35:53,136 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:35:53,136 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:35:53,136 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:35:53,136 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 10:35:53,136 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:35:53,136 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:35:53,137 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009103552_instance_0_cores_0-95.log
II. Use all cores including logical cores¶
python -m intel_extension_for_tensorflow.python.launch --use_logical_core --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009104740_instances.log
└── run_20221009104740_instance_0_cores_0-191.log
The run_20221009104740_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009104740_instances.log
2022-10-09 10:47:40,908 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:47:40,909 - __main__ - INFO - OMP_NUM_THREADS=192
2022-10-09 10:47:40,909 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:47:40,909 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:47:40,909 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:47:40,909 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=192
2022-10-09 10:47:40,909 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:47:40,909 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:47:40,909 - __main__ - INFO - numactl --localalloc -C 0-191 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009104740_instance_0_cores_0-191.log
III. Use physical cores on one node¶
python -m intel_extension_for_tensorflow.python.launch --node_id 1 --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009105044_instances.log
└── run_20221009105044_instance_0_cores_12-23.log
The run_20221009105044_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009105044_instances.log
2022-10-09 10:50:44,693 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:50:44,693 - __main__ - INFO - OMP_NUM_THREADS=12
2022-10-09 10:50:44,693 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:50:44,693 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:50:44,693 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:50:44,693 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=12
2022-10-09 10:50:44,693 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:50:44,693 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:50:44,694 - __main__ - INFO - numactl --localalloc -C 12-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105044_instance_0_cores_12-23.log
IV. Use your designated number of cores¶
python -m intel_extension_for_tensorflow.python.launch --ninstances 1 --ncore_per_instance 10 --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009105320_instances.log
└── run_20221009105320_instance_0_cores_0-9.log
The run_20221009105320_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009105320_instances.log
2022-10-09 10:53:21,089 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:53:21,089 - __main__ - INFO - OMP_NUM_THREADS=10
2022-10-09 10:53:21,089 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:53:21,089 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:53:21,089 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:53:21,089 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=10
2022-10-09 10:53:21,089 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:53:21,089 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:53:21,090 - __main__ - INFO - numactl --localalloc -C 0-9 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105320_instance_0_cores_0-9.log
Multiple instances for inference¶
V. Throughput mode¶
python -m intel_extension_for_tensorflow.python.launch --throughput_mode --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009105838_instances.log
├── run_20221009105838_instance_0_cores_0-11.log
├── run_20221009105838_instance_1_cores_12-23.log
├── run_20221009105838_instance_2_cores_24-35.log
├── run_20221009105838_instance_3_cores_36-47.log
├── run_20221009105838_instance_4_cores_48-59.log
├── run_20221009105838_instance_5_cores_60-71.log
├── run_20221009105838_instance_6_cores_72-83.log
└── run_20221009105838_instance_7_cores_84-95.log
The run_20221009105838_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009105838_instances.log
2022-10-09 10:58:38,757 - __main__ - WARNING - --throughput_mode is exclusive to --ninstances, --ncore_per_instance, --node_id and --use_logical_core. They won't take effect even if they are set explicitly.
2022-10-09 10:58:38,772 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 10:58:38,772 - __main__ - INFO - OMP_NUM_THREADS=12
2022-10-09 10:58:38,772 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 10:58:38,772 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 10:58:38,772 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 10:58:38,772 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=12
2022-10-09 10:58:38,772 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 10:58:38,772 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 10:58:38,772 - __main__ - INFO - numactl --localalloc -C 0-11 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_0_cores_0-11.log
2022-10-09 10:58:38,784 - __main__ - INFO - numactl --localalloc -C 12-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_1_cores_12-23.log
2022-10-09 10:58:38,795 - __main__ - INFO - numactl --localalloc -C 24-35 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_2_cores_24-35.log
2022-10-09 10:58:38,806 - __main__ - INFO - numactl --localalloc -C 36-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_3_cores_36-47.log
2022-10-09 10:58:38,817 - __main__ - INFO - numactl --localalloc -C 48-59 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_4_cores_48-59.log
2022-10-09 10:58:38,828 - __main__ - INFO - numactl --localalloc -C 60-71 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_5_cores_60-71.log
2022-10-09 10:58:38,839 - __main__ - INFO - numactl --localalloc -C 72-83 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_6_cores_72-83.log
2022-10-09 10:58:38,850 - __main__ - INFO - numactl --localalloc -C 84-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009105838_instance_7_cores_84-95.log
VI. Latency mode¶
python -m intel_extension_for_tensorflow.python.launch --latency_mode --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009110327_instances.log
├── run_20221009110327_instance_0_cores_0-3.log
├── run_20221009110327_instance_1_cores_4-7.log
├── run_20221009110327_instance_2_cores_8-11.log
├── run_20221009110327_instance_3_cores_12-15.log
├── run_20221009110327_instance_4_cores_16-19.log
├── run_20221009110327_instance_5_cores_20-23.log
├── run_20221009110327_instance_6_cores_24-27.log
├── run_20221009110327_instance_7_cores_28-31.log
├── run_20221009110327_instance_8_cores_32-35.log
├── run_20221009110327_instance_9_cores_36-39.log
├── run_20221009110327_instance_10_cores_40-43.log
├── run_20221009110327_instance_11_cores_44-47.log
├── run_20221009110327_instance_12_cores_48-51.log
├── run_20221009110327_instance_13_cores_52-55.log
├── run_20221009110327_instance_14_cores_56-59.log
├── run_20221009110327_instance_15_cores_60-63.log
├── run_20221009110327_instance_16_cores_64-67.log
├── run_20221009110327_instance_17_cores_68-71.log
├── run_20221009110327_instance_18_cores_72-75.log
├── run_20221009110327_instance_19_cores_76-79.log
├── run_20221009110327_instance_20_cores_80-83.log
├── run_20221009110327_instance_21_cores_84-87.log
├── run_20221009110327_instance_22_cores_88-91.log
└── run_20221009110327_instance_23_cores_92-95.log
The run_20221009110327_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009110327_instances.log
2022-10-09 11:03:27,198 - __main__ - WARNING - --latency_mode is exclusive to --ninstances, --ncore_per_instance, --node_id and --use_logical_core. They won't take effect even if they are set explicitly.
2022-10-09 11:03:27,215 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:03:27,215 - __main__ - INFO - OMP_NUM_THREADS=4
2022-10-09 11:03:27,215 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:03:27,215 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:03:27,215 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:03:27,215 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=4
2022-10-09 11:03:27,215 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:03:27,215 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:03:27,216 - __main__ - INFO - numactl --localalloc -C 0-3 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_0_cores_0-3.log
2022-10-09 11:03:27,229 - __main__ - INFO - numactl --localalloc -C 4-7 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_1_cores_4-7.log
2022-10-09 11:03:27,241 - __main__ - INFO - numactl --localalloc -C 8-11 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_2_cores_8-11.log
2022-10-09 11:03:27,254 - __main__ - INFO - numactl --localalloc -C 12-15 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_3_cores_12-15.log
2022-10-09 11:03:27,266 - __main__ - INFO - numactl --localalloc -C 16-19 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_4_cores_16-19.log
2022-10-09 11:03:27,278 - __main__ - INFO - numactl --localalloc -C 20-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_5_cores_20-23.log
2022-10-09 11:03:27,290 - __main__ - INFO - numactl --localalloc -C 24-27 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_6_cores_24-27.log
2022-10-09 11:03:27,302 - __main__ - INFO - numactl --localalloc -C 28-31 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_7_cores_28-31.log
2022-10-09 11:03:27,315 - __main__ - INFO - numactl --localalloc -C 32-35 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_8_cores_32-35.log
2022-10-09 11:03:27,327 - __main__ - INFO - numactl --localalloc -C 36-39 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_9_cores_36-39.log
2022-10-09 11:03:27,339 - __main__ - INFO - numactl --localalloc -C 40-43 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_10_cores_40-43.log
2022-10-09 11:03:27,351 - __main__ - INFO - numactl --localalloc -C 44-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_11_cores_44-47.log
2022-10-09 11:03:27,364 - __main__ - INFO - numactl --localalloc -C 48-51 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_12_cores_48-51.log
2022-10-09 11:03:27,376 - __main__ - INFO - numactl --localalloc -C 52-55 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_13_cores_52-55.log
2022-10-09 11:03:27,388 - __main__ - INFO - numactl --localalloc -C 56-59 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_14_cores_56-59.log
2022-10-09 11:03:27,400 - __main__ - INFO - numactl --localalloc -C 60-63 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_15_cores_60-63.log
2022-10-09 11:03:27,413 - __main__ - INFO - numactl --localalloc -C 64-67 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_16_cores_64-67.log
2022-10-09 11:03:27,425 - __main__ - INFO - numactl --localalloc -C 68-71 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_17_cores_68-71.log
2022-10-09 11:03:27,438 - __main__ - INFO - numactl --localalloc -C 72-75 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_18_cores_72-75.log
2022-10-09 11:03:27,452 - __main__ - INFO - numactl --localalloc -C 76-79 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_19_cores_76-79.log
2022-10-09 11:03:27,465 - __main__ - INFO - numactl --localalloc -C 80-83 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_20_cores_80-83.log
2022-10-09 11:03:27,480 - __main__ - INFO - numactl --localalloc -C 84-87 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_21_cores_84-87.log
2022-10-09 11:03:27,494 - __main__ - INFO - numactl --localalloc -C 88-91 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_22_cores_88-91.log
2022-10-09 11:03:27,509 - __main__ - INFO - numactl --localalloc -C 92-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110327_instance_23_cores_92-95.log
VII. Your designated number of instances¶
python -m intel_extension_for_tensorflow.python.launch --ninstances 4 --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009110849_instances.log
├── run_20221009110849_instance_0_cores_0-10.log
├── run_20221009110849_instance_1_cores_11-21.log
├── run_20221009110849_instance_2_cores_22-32.log
└── run_20221009110849_instance_3_cores_33-43.log
The run_20221009110849_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009110849_instances.log
2022-10-09 11:08:49,891 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:08:49,891 - __main__ - INFO - OMP_NUM_THREADS=24
2022-10-09 11:08:49,891 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:08:49,891 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:08:49,892 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:08:49,892 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=24
2022-10-09 11:08:49,892 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:08:49,892 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:08:49,892 - __main__ - INFO - numactl --localalloc -C 0-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_0_cores_0-23.log
2022-10-09 11:08:49,908 - __main__ - INFO - numactl --localalloc -C 24-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_1_cores_24-47.log
2022-10-09 11:08:49,930 - __main__ - INFO - numactl --localalloc -C 48-71 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_2_cores_48-71.log
2022-10-09 11:08:49,951 - __main__ - INFO - numactl --localalloc -C 72-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009110849_instance_3_cores_72-95.log
VIII. Your designated number of instances and instance index¶
Launcher by default runs all ninstances
for multi-instance inference and training as shown above. You can specify instance_idx
to independently run that instance only among ninstances
python -m intel_extension_for_tensorflow.python.launch --ninstances 4 --instance_idx 0 --log_path ./logs infer_resnet50.py
you can confirm usage in log file:
2022-10-09 11:10:34,586 - __main__ - INFO - assigning 24 cores for instance 0
2022-10-09 11:10:34,604 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:10:34,604 - __main__ - INFO - OMP_NUM_THREADS=24
2022-10-09 11:10:34,605 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:10:34,605 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:10:34,605 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:10:34,605 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=24
2022-10-09 11:10:34,605 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:10:34,605 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:10:34,605 - __main__ - INFO - numactl --localalloc -C 0-23 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111034_instance_0_cores_0-23.log
python -m intel_extension_for_tensorflow.python.launch --ninstances 4 --instance_idx 1 --log_path ./logs infer_resnet50.py
you can confirm usage in log file:
2022-10-09 11:12:40,129 - __main__ - INFO - assigning 24 cores for instance 1
2022-10-09 11:12:40,144 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:12:40,144 - __main__ - INFO - OMP_NUM_THREADS=24
2022-10-09 11:12:40,144 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:12:40,144 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:12:40,144 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:12:40,144 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=24
2022-10-09 11:12:40,144 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:12:40,144 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:12:40,145 - __main__ - INFO - numactl --localalloc -C 24-47 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111239_instance_0_cores_24-47.log
Set environment variables for inference¶
IX. Set environment variable TF_NUM_INTRAOP_THREADS¶
python -m intel_extension_for_tensorflow.python.launch --tf_num_intraop_threads 8 --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009111753_instances.log
├── run_20221009111753_instance_0_cores_0-95.log
The run_20221009111753_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009111753_instances.log
2022-10-09 11:17:53,947 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:17:53,947 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:17:53,947 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:17:53,947 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:17:53,948 - __main__ - INFO - TF_NUM_INTEROP_THREADS=2
2022-10-09 11:17:53,948 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:17:53,948 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:17:53,948 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:17:53,948 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111753_instance_0_cores_0-95.log
X. Set environment variable TF_NUM_INTEROP_THREADS¶
python -m intel_extension_for_tensorflow.python.launch --tf_num_interop_threads 2 --log_path ./logs infer_resnet50.py
Check your log directory, its structure is as below.
.
├── infer_resnet50.py
└── logs
├── run_20221009111951_instances.log
└── run_20221009111951_instance_0_cores_0-95.log
The run_20221009111951_instances.log
contains information and command that were used for this execution launch.
$ cat logs/run_20221009111951_instances.log
2022-10-09 11:19:51,404 - __main__ - WARNING - Neither TCMalloc nor JeMalloc is found in $CONDA_PREFIX/lib or $VIRTUAL_ENV/lib or /.local/lib/ or /usr/local/lib/ or /usr/local/lib64/ or /usr/lib or /usr/lib64 or /home/sdp/.local/lib/ so the LD_PRELOAD environment variable will not be set. This may drop the performance.
2022-10-09 11:19:51,405 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:19:51,405 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:19:51,405 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:19:51,405 - __main__ - INFO - TF_NUM_INTEROP_THREADS=2
2022-10-09 11:19:51,405 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:19:51,405 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:19:51,405 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:19:51,405 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009111951_instance_0_cores_0-95.log
Usage of TCMalloc/Jemalloc/Default memory allocator¶
Memory allocator can influence performance. If users do not designate a desired memory allocator, the launch script searches them in the order of TCMalloc > Jemalloc > Tensorflow default memory allocator, and takes the first matched one.
Jemalloc¶
Note: You can set your preferred value to MALLOC_CONF before running the launch script if you do not want to use its default setting.
python -m intel_extension_for_tensorflow.python.launch --enable_jemalloc --log_path ./logs infer_resnet50.py
you can confirm usage in log file:
2022-10-09 11:27:20,549 - __main__ - INFO - Use JeMalloc memory allocator
2022-10-09 11:27:20,550 - __main__ - INFO - MALLOC_CONF=oversize_threshold:1,background_thread:true,metadata_thp:auto
2022-10-09 11:27:20,550 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:27:20,550 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:27:20,550 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:27:20,550 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:27:20,550 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:27:20,550 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:27:20,550 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:27:20,550 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009112720_instance_0_cores_0-95.log
TCMalloc¶
python -m intel_extension_for_tensorflow.python.launch --enable_tcmalloc --log_path ./logs infer_resnet50.py
you can confirm usage in log file:
2022-10-09 11:29:05,206 - __main__ - INFO - Use TCMalloc memory allocator
2022-10-09 11:29:05,207 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:29:05,207 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:29:05,207 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:29:05,207 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:29:05,207 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:29:05,207 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:29:05,207 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:29:05,207 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009112905_instance_0_cores_0-95.log
Default memory allocator¶
python -m intel_extension_for_tensorflow.python.launch --use_default_allocator --log_path ./logs infer_resnet50.py
you can confirm usage in log file:
2022-10-09 11:29:56,911 - __main__ - INFO - OMP_NUM_THREADS=96
2022-10-09 11:29:56,911 - __main__ - INFO - KMP_AFFINITY=granularity=fine,verbose,compact,1,0
2022-10-09 11:29:56,911 - __main__ - INFO - KMP_BLOCKTIME=1
2022-10-09 11:29:56,911 - __main__ - INFO - TF_NUM_INTEROP_THREADS=1
2022-10-09 11:29:56,911 - __main__ - INFO - TF_NUM_INTRAOP_THREADS=96
2022-10-09 11:29:56,911 - __main__ - INFO - TF_ENABLE_ONEDNN_OPTS=1
2022-10-09 11:29:56,911 - __main__ - INFO - ITEX_LAYOUT_OPT=0
2022-10-09 11:29:56,911 - __main__ - INFO - numactl --localalloc -C 0-95 <VIRTUAL_ENV>/bin/python -u infer_resnet50.py 2>&1 | tee ./logs/run_20221009112956_instance_0_cores_0-95.log