Accelerate ResNet50 Training by XPUAutoShard on Intel GPU

Introduction

The XPUAutoShard feature of Intel® Extension for TensorFlow* automatically shards the input data to the Intel® GPU devices. Currently, it supports applying the shards on multiple GPU tiles to maximize the hardware utilization and improve performance.

This example shows ResNet50 training speedup with XPUAutoShard enabled.

Hardware Requirements

Verified Hardware Platforms:

  • Intel® Data Center GPU Max Series

Prerequisites

This example only applies to stock TensorFlow* >=2.12.0 and Intel® Extension for TensorFlow* >=1.2.0.

Prepare the Codes

git clone https://github.com/tensorflow/models tf-models
cd tf-models
git checkout r2.12
git apply ../shard.patch

Prepare for GPU

Refer to Prepare

Install Other Required Packages

pip install -r official/requirements.txt

Enable Running Environment

Refer to Running to enable oneAPI running environment and virtual running environment.

Setup PYTHONPATH

Modify /path/to/tf-models accordingly, here ~/tf-models as an example.

cd official/legacy/image_classification/resnet/
mkdir output
export PYTHONPATH=$PYTHONPATH:/path/to/tf-models:$PWD]

Executes the Example with Python API

Without XPUAutoShard

export TF_NUM_INTEROP_THREADS=<number of physical core per socket> 
export TF_NUM_INTRAOP_THREADS=<number of physical core per socket>
export BS = 256
python resnet_ctl_imagenet_main.py \
--num_gpus=1 \
--batch_size=$BS \
--train_epochs=1 \
--train_steps=30 \
--steps_per_loop=1 \
--log_steps=1 \
--skip_eval \
--use_synthetic_data=true \
--distribution_strategy=off \
--use_tf_while_loop=false \
--use_tf_function=true --enable_xla=false \
--enable_tensorboard=false --enable_checkpoint_and_export=false \
--data_format=channels_last --single_l2_loss_op=True \
--model_dir=output \
--dtype=bf16 2>&1 | tee resnet50.log

With XPUAutoShard

Python API

Intel® Extension for TensorFlow* provides Python APIs to enable XPUAutoShard feature as follws:

config = itex.ShardingConfig()
config.auto_mode = False
device_gpu = config.devices.add()
device_gpu.device_type = "gpu"
device_gpu.device_num = 2
device_gpu.batch_size = 256
device_gpu.stage_num = 10
graph_opts = itex.GraphOptions(sharding=itex.ON, sharding_config = config)
itex_cfg = itex.ConfigProto(graph_options=graph_opts)
itex.set_config(itex_cfg)

Sharding Parameters Setting

In this example, the above code has been added to resnet_ctl_imagenet_main.py with the patch and you can enable XPUAutoShard via simply adding --use_itex_sharding=True to the command-line. You can optionally modify the following parameters in the ShardingConfig based on your need.

Prameters Config Suggestions
device_num 2 for Intel® Data Center GPU Max Series with 2 tiles
batch_size batch size on each device in each loop of each iteration
stage_num number of training loops on each device with each iteration
before the All-reduce and weight updating on GPU devices,
set it >=2 to improve scaling efficiency

The global batch size should be device_num * batch_size * stage_num. In this example, the default global batch size is 2x256x10=5120.

Further Settings

For further performance speedup, you can enable multi-stream via setting ITEX_ENABLE_MULTIPLE_STREAM=1 to create multiple queues for each device.

Executing Command

export TF_NUM_INTEROP_THREADS=<number of physical core per socket> 
export TF_NUM_INTRAOP_THREADS=<number of physical core per socket>
export BS=5120
export ITEX_ENABLE_MULTIPLE_STREAM=1
python resnet_ctl_imagenet_main.py \
--num_gpus=1 \
--batch_size=$BS \
--train_epochs=1 \
--train_steps=30 \
--steps_per_loop=1 \
--log_steps=1 \
--skip_eval \
--use_synthetic_data=true \
--distribution_strategy=off \
--use_tf_while_loop=false \
--use_tf_function=true --enable_xla=false \
--enable_tensorboard=false --enable_checkpoint_and_export=false \
--data_format=channels_last --single_l2_loss_op=True \
--model_dir=output \
--dtype=bf16 \
--use_itex_sharding=true 2>&1 | tee resnet50_itex-shard.log

The following output log indicates XPUAutoShard has been enabled successfully:
I itex/core/graph/tfg_optimizer_hook/tfg_optimizer_hook.cc:289] Run AutoShard pass successfully

Example Output

With successful execution, it will print out the following results:

...
I0324 07:55:20.594147 140348344015936 keras_utils.py:145] TimeHistory: xxxxx seconds, xxxxx examples/second between steps 0 and 1
I0324 07:55:20.597360 140348344015936 controller.py:479] train | step:      1 | steps/sec:    xxxxx | output: {'train_accuracy': 0.0, 'train_loss': 12.634554}
I0324 07:55:22.161625 140348344015936 keras_utils.py:145] TimeHistory: xxxxx seconds, xxxxx examples/second between steps 1 and 2
I0324 07:55:22.163815 140348344015936 controller.py:479] train | step:      2 | steps/sec:    xxxxx | output: {'train_accuracy': 0.0, 'train_loss': 12.634554}
I0324 07:55:23.790632 140348344015936 keras_utils.py:145] TimeHistory: xxxxx seconds, xxxxx examples/second between steps 2 and 3
I0324 07:55:23.792936 140348344015936 controller.py:479] train | step:      3 | steps/sec:    xxxxx | output: {'train_accuracy': 1.0, 'train_loss': 9.103148}
I0324 07:55:25.416651 140348344015936 keras_utils.py:145] TimeHistory: xxxxx seconds, xxxxx examples/second between steps 3 and 4
I0324 07:55:25.419072 140348344015936 controller.py:479] train | step:      4 | steps/sec:    xxxxx | output: {'train_accuracy': 1.0, 'train_loss': 5.3359284}
I0324 07:55:27.025180 140348344015936 keras_utils.py:145] TimeHistory: xxxxx seconds, xxxxx examples/second between steps 4 and 5
I0324 07:55:27.027671 140348344015936 controller.py:479] train | step:      5 | steps/sec:    xxxxx | output: {'train_accuracy': 1.0, 'train_loss': 5.3343554}
...