Accelerate Mask R-CNN Training on Intel GPU

Introduction

Intel® Extension for TensorFlow* is compatible with stock TensorFlow*. This example shows Mask R-CNN Training. It contains single-tile training scripts and multi-tile training scripts with horovod.

Install the Intel® Extension for TensorFlow* in legacy running environment, Tensorflow will execute the Training on Intel GPU.

Hardware Requirements

Verified Hardware Platforms:

  • Intel® Data Center GPU Max Series

Prerequisites

Model Code change

To get better performance, instead of installing the official repository, you can apply the patch and install it as shown here:

git clone https://github.com/NVIDIA/DeepLearningExamples.git
cd DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN
git checkout c481324031ecf0f70f8939516c02e16cac60446d
git apply patch  # When applying this patch, please move it to the above MaskRCNN dir first.

Prepare for GPU

Refer to Prepare.

Setup Running Environment

You can use ./pip_set_env.sh to setup for GPU. It contains the following two steps: creating virtual environment and installing python packages.

  • Create Virtual Environment

python -m venv env_itex
source env_itex/bin/activate
  • Install

pip install --upgrade pip
pip install --upgrade intel-extension-for-tensorflow[xpu]
pip install opencv-python-headless pybind11
pip install pycocotools
pip install -e "git+https://github.com/NVIDIA/dllogger#egg=dllogger"

Enable Running Environment

Enable oneAPI running environment (only for GPU) and virtual running environment.

Prepare Dataset

Assume current_dir is examples/train_maskrcnn/DeepLearningExamples/TensorFlow2/Segmentation/MaskRCNN. So as the following parts.

cd dataset
bash download_and_preprocess_coco.sh ./data

Execute the Example

Here we provide single-tile training scripts and multi-tile training scripts with horovod. The datatype can be float32 or bfloat16.

DATASET_DIR=./data
OUTPUT_DIR=/the/path/to/output_dir
  • Single tile with fp32

python main.py train \
--data_dir $DATASET_DIR \
--model_dir=$OUTPUT_DIR \
--train_batch_size 4 \
--seed=0 --use_synthetic_data \
--epochs 1 --steps_per_epoch 20 --log_every=1 --log_warmup_steps=1
  • Single tile with bf16, it requires --amp flag.

python main.py train \
--data_dir $DATASET_DIR \
--model_dir=$OUTPUT_DIR \
--train_batch_size 4 \
--amp --seed=0 --use_synthetic_data \
--epochs 1 --steps_per_epoch 20 --log_every=1 --log_warmup_steps=1
  • Multi-tile with horovod.

Install intel-optimization-for-horovod.

pip install intel-optimization-for-horovod

Default datatype is fp32. You can use --amp flag for bf16.

mpirun -np 2 -prepend-rank -ppn 2 \
python main.py train \
--data_dir $DATASET_DIR \
--model_dir=$OUTPUT_DIR \
--train_batch_size 4 \
--seed=0 --use_synthetic_data \
--epochs 1 --steps_per_epoch 20 --log_every=1 --log_warmup_steps=1

Note: Only distributed workload needs intel-optimization-for-horovod. Please uninstall it if you want to run single tile workload.

FAQ

  1. If you get the following error log, refer to Enable Running Environment to Enable oneAPI running environment.

tensorflow.python.framework.errors_impl.NotFoundError: libmkl_sycl.so.2: cannot open shared object file: No such file or directory