Distributed Training Example with Intel® Optimization for Horovod*

Model Information

Use Case Framework Model Repo Branch Commit Tag Optional Patch
Training TensorFlow Tensorflow-Models v2.8.0 itex.yaml
itex_dummy.yaml
hvd_support_light.patch
or hvd_support.patch

Dependency

pip install gin gin-config tensorflow-addons tensorflow-model-optimization tensorflow-datasets

Model examples preparation

Model Repo

WORKSPACE=xxxx # set your workspace folder
cd $WORKSPACE
git clone -b v2.8.0 https://github.com/tensorflow/models.git tensorflow-models
cd tensorflow-models
git apply path/to/hvd_support_light.patch  # or path/to/hvd_support.patch

hvd_support_light.patch is the minimum change.

  • hvd.init() is Horovod initialization, including resource allocation.

  • tf.config.experimental.set_memory_growth(): If memory growth is enabled, the runtime initialization will not allocate all memory on the device.

  • tf.config.experimental.set_visible_devices(): Set the list of visible devices.

  • strategy_scope: Remove native distributed.

  • hvd.DistributedOptimizer(): use Horovod distributed optimizer.

  • dataset.shard(): Multiple workers run the same code but with different data. Dataset is split equally between different index workers.

hvd_support.patch adds LARS optimizer paper

Download Dataset

Download imagenet dataset from https://image-net.org/download-images.php

Note Only for non-commercial research and/or educational purposes


Execution

Set Model Parameters

Export those parameters to script or environment.

export PYTHONPATH=${WORKSPACE}/tensorflow-models
MODEL_DIR=${WORKSPACE}/output
DATA_DIR=${WORKSPACE}/imagenet_data/imagenet

CONFIG_FILE=path/to/itex.yaml
NUMBER_OF_PROCESS=2
PROCESS_PER_NODE=2
  • Download itex.yaml or itex_dummy.yaml and set one of them as CONFIG_FILE, then model would correspondingly run with real data or dummy data. Default value is itex.yaml.

  • Set NUMBER_OF_PROCESS and PROCESS_PER_NODE according to hvd rank number you need. Default value is a 2 rank task.

HVD command

if [ ! -d "$MODEL_DIR" ]; then
    mkdir -p $MODEL_DIR
else
    rm -rf $MODEL_DIR && mkdir -p $MODEL_DIR                         
fi

mpirun -np $NUMBER_OF_PROCESS -ppn $PROCESS_PER_NODE --prepend-rank \
python ${PYTHONPATH}/official/vision/image_classification/classifier_trainer.py \
--mode=train_and_eval \
--model_type=resnet \
--dataset=imagenet \
--model_dir=$MODEL_DIR \
--data_dir=$DATA_DIR \
--config_file=$CONFIG_FILE

OUTPUT

Performance Data

[1] I0909 03:33:23.323099 140645511436096 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 0 and 100
[0] I0909 03:33:23.324534 140611700504384 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 0 and 100
[0] I0909 03:33:43.037004 140611700504384 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 100 and 200
[1] I0909 03:33:43.037142 140645511436096 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 100 and 200
[1] I0909 03:34:03.213994 140645511436096 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 200 and 300
[0] I0909 03:34:03.214127 140611700504384 keras_utils.py:145] TimeHistory: xxxx seconds, xxxx examples/second between steps 200 and 300