# Resnet50 train on Intel GPU ## Introduction IntelĀ® Extension for TensorFlow* is compatible with stock Tensorflow*. This example shows resnet50 training. ## Hardware Requirements Verified Hardware Platforms: - IntelĀ® Data Center GPU Max Series ## Prerequisites ### Model Code change We optimized bf16 in resnet50.patch, and enable horovod and LARS in hvd_support.patch, please apply patch ``` git clone -b v2.14.0 https://github.com/tensorflow/models.git tensorflow-models ``` ### Prepare for GPU (Skip this step for CPU) Refer to [Prepare](../common_guide_running.html##Prepare) ### Setup Running Environment * Setup for GPU ```bash ./pip_set_env.sh ``` ### Enable Running Environment Enable oneAPI running environment (only for GPU) and virtual running environment. * For GPU, refer to [Running](../common_guide_running.html##Running) ### Apply Patch #### If not use Horovod ``` git apply path/to/configure/resnet50.patch ``` #### If use Horovod ``` git apply path/to/hvd_configure/hvd_support.patch ``` #### Prepare ImageNet dataset Using TFDS classifier_trainer.py supports ImageNet with [TensorFlow Datasets(TFDS)](https://www.tensorflow.org/datasets/overview) . Please see the following [example snippet](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/scripts/download_and_prepare.py) for more information on how to use TFDS to download and prepare datasets, and specifically the [TFDS ImageNet readme](https://github.com/tensorflow/datasets/blob/master/docs/catalog/imagenet2012.html) for manual download instructions. Legacy TFRecords Download the ImageNet dataset and convert it to TFRecord format. The following [script](https://github.com/tensorflow/tpu/blob/master/tools/datasets/imagenet_to_gcs.py) and [README](https://github.com/tensorflow/tpu/tree/master/tools/datasets#imagenet_to_gcspy) provide a few options. Note that the legacy ResNet runners, e.g. [resnet/resnet_ctl_imagenet_main.py](https://github.com/tensorflow/models/blob/v2.14.0/official/legacy/image_classification/resnet/resnet_ctl_imagenet_main.py) require TFRecords whereas `classifier_trainer.py` can use both by setting the builder to 'records' or 'tfds' in the configurations. ## Execution ### Set Model Parameters There are several config yaml files in configure and hvd_configure folder. Set one of them as CONFIG_FILE, then model would correspondly run with `real data` or `dummy data`. Single-tile please use yaml file in configure folder. Distribute training please use yaml file in hvd_configure folder, `itex_bf16_lars.yaml`/`itex_fp32_lars.yaml` for HVD real data and `itex_dummy_bf16_lars.yaml`/`itex_dummy_fp32_lars.yaml` for HVD dummy data. Export those parameters to script or environment. ``` export PYTHONPATH=/the/path/to/tensorflow-models MODEL_DIR=/the/path/to/output DATA_DIR=/the/path/to/imagenet CONFIG_FILE=path/to/itex_xx.yaml # itex_bf16.yaml/itex_fp32.yaml for accuracy, itex_dummy_bf16.yaml/itex_dummy_fp32.yaml for benchmark ``` ### Command ``` if [ ! -d "$MODEL_DIR" ]; then mkdir -p $MODEL_DIR else rm -rf $MODEL_DIR && mkdir -p $MODEL_DIR fi python ${PYTHONPATH}/official/legacy/image_classification/classifier_trainer.py \ --mode=train_and_eval \ --model_type=resnet \ --dataset=imagenet \ --model_dir=$MODEL_DIR \ --data_dir=$DATA_DIR \ --config_file=$CONFIG_FILE ``` ### Command with Horovod Set `NUMBER_OF_PROCESS` and `PROCESS_PER_NODE` according to hvd rank number you need. Default value is 2 rank task. ``` if [ ! -d "$MODEL_DIR" ]; then mkdir -p $MODEL_DIR else rm -rf $MODEL_DIR && mkdir -p $MODEL_DIR fi NUMBER_OF_PROCESS=2 PROCESS_PER_NODE=2 mpirun -np $NUMBER_OF_PROCESS -ppn $PROCESS_PER_NODE --prepend-rank \ python ${PYTHONPATH}/official/legacy/image_classification/classifier_trainer.py \ --mode=train_and_eval \ --model_type=resnet \ --dataset=imagenet \ --model_dir=$MODEL_DIR \ --data_dir=$DATA_DIR \ --config_file=$CONFIG_FILE ``` ## Example Output without hvd ``` I0203 02:48:01.006297 139660941027136 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 1900 and 2000 I0203 02:48:16.590331 139660941027136 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 2000 and 2100 I0203 02:48:32.178206 139660941027136 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 2100 and 2200 I0203 02:48:47.790128 139660941027136 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 2200 and 2300 I0203 02:49:03.408512 139660941027136 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 2300 and 2400 ``` ## Example Output with hvd ``` [0] I0817 00:09:07.602742 139898862851904 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 400 and 600 [1] I0817 00:09:07.603262 140612319840064 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 400 and 600 [0] I0817 00:10:07.917546 139898862851904 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 600 and 800 [1] I0817 00:10:07.917738 140612319840064 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 600 and 800 [0] I0817 00:11:08.277716 139898862851904 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 800 and 1000 [1] I0817 00:11:08.277811 140612319840064 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 800 and 1000 [0] I0817 00:12:08.555174 139898862851904 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 1000 and 1200 [1] I0817 00:12:08.555221 140612319840064 keras_utils.py:145] TimeHistory: xx seconds, xxxx examples/second between steps 1000 and 1200