# Performance Data - [Overview](#overview) - [Models](#models) - [Training Workloads](#training-workloads) - [Inference Workloads](#inference-workloads) - [Training Accuracy Results](#training-accuracy-results) - [Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550](#training-accuracy-on-1-node-of-4x-intel-data-center-gpu-max-1550) - [Training Performance Results](#training-performance-results) - [Training Performance on 1-node of 4x Intel Data Center GPU Max 1550](#training-performance-on-1-node-of-4x-intel-data-center-gpu-max-1550) - [ResNet50v1-5 Training Performance Results](#resnet50v1-5-training-performance-results) - [BERT-Large Phase2 Training Performance Results](#bert-large-phase2-training-performance-results) - [Mask-RCNN Training Performance Results](#mask-rcnn-training-performance-results) - [Medical Image 3D U-Net Training Performance Results](#medical-image-3d-u-net-training-performance-results) - [Inference Performance Results](#inference-performance-results) - [Inference Performance on 1x Intel Data Center GPU Flex 170](#inference-performance-on-1x-intel-data-center-gpu-flex-170) - [ResNet50v1-5 Inference Performance Results](#resnet50v1-5-inference-performance-results) - [EfficientNet-B0 Inference Performance Results](#efficientnet-b0-inference-performance-results) - [EfficientNet-B3 Inference Performance Results](#efficientnet-b3-inference-performance-results) - [Mask-RCNN Inference Performance Results](#mask-rcnn-inference-performance-results) - [Stable Diffusion v1-4 Inference Performance Results](#stable-diffusion-v1-4-inference-performance-results) - [Configuration](#configuration) - [Software Configuration](#software-configuration) - [Software Configuration for Intel Max 1550 GPU](#software-configuration-for-intel-max-1550-gpu) - [Software Configuration for Intel Flex 170 GPU](#software-configuration-for-intel-flex-170-gpu) - [Hardware Configuration](#hardware-configuration) - [Hardware Configuration for Intel Max 1550 GPU](#hardware-configuration-for-intel-max-1550-gpu) - [Hardware Configuration for Intel Flex 170 GPU](#hardware-configuration-for-intel-flex-170-gpu) - [Additional Performance Data for Intel AI Data Center Products](#additional-performance-data-for-intel-ai-data-center-products) ## Overview This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow\* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in [examples](../../examples/README.html). ## Models The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model. ### Training Workloads |Model|Original Model Repo|ITEX Step-by-Step Guide| |-|-|-| |ResNet50v1.5|[TensorFlow-Models/ResNet50v1.5](https://github.com/tensorflow/models/tree/v2.14.0/official/legacy/image_classification/)|[Resnet50 train on Intel GPU](../../examples/train_resnet50/README.html)| |BERT-Large|[DeepLearningExamples/BERT](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/LanguageModeling/BERT/)|[Accelerate BERT-Large Pretraining on Intel GPU](../../examples/pretrain_bert/README.html) |Mask-RCNN|[DeepLearningExamples/Mask-RCNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN/)|[Accelerate Mask R-CNN Training on Intel GPU](../../examples/train_maskrcnn/README.html)| |3D-UNet|[DeepLearningExamples/3D-UNet](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Segmentation/UNet_3D_Medical/)|[Accelerate 3D-UNet Training for medical image segmentation on Intel GPU](../../examples/train_3d_unet/README.html)| ### Inference Workloads |Model|Original Model Repo|ITEX Step-by-Step Guide| |-|-|-| |ResNet50v1.5|[Intel-Reference-Models/ResNet50v1.5](https://github.com/IntelAI/models/tree/v3.1.0/models_v2/tensorflow/resnet50v1_5/inference/gpu/)|[ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow\*](https://github.com/IntelAI/models/tree/v3.1.0/models_v2/tensorflow/resnet50v1_5/inference/gpu/)| |EfficientNet-B0|[Keras-Applications/EfficientNet](https://keras.io/api/applications/efficientnet/)|Use the exact same codes and instructions as in the orignal model repo| |EfficientNet-B3|[Keras-Applications/EfficientNet](https://keras.io/api/applications/efficientnet/)|Use the exact same codes and instructions as in the orignal model repo| |Mask-RCNN|[DeepLearningExamples/Mask-RCNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN/)|Use the exact same codes and instructions as in the orignal model repo| |Stable Diffusion v1-4|[KerasCV/Stable-Diffusion](https://github.com/keras-team/keras-cv/tree/master/keras_cv/src/models/stable_diffusion)|[Stable Diffusion Inference for Text2Image on Intel GPU](../../examples/stable_diffussion_inference/README.html)| ## Training Accuracy Results ### Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550 The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). ||Pre-training Phase1|Pre-training Phase2|Fine-Tuning| |-|-|-|-| |**Dataset**|[Wikipedia](https://dumps.wikimedia.org/) and [BookCorpus](https://yknzhu.wixsite.com/mbweb/)|[Wikipedia](https://dumps.wikimedia.org/) and [BookCorpus](https://yknzhu.wixsite.com/mbweb/)|[SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) 1.1| |**Maximum Sequence Length**|128|512|384| |**Data Type**|BF16|BF16|BF16| |**Throughput (sequences/sec)**|3265.35|699.25|523.55| |**Time to Train (hours)**|39.32|20.40|0.67| |**Loss**|1.6047|1.3870|0.6867| ## Training Performance Results ### Training Performance on 1-node of 4x Intel Data Center GPU Max 1550 The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow\* and Intel® Optimization for Horovod\*. > **Note**: The training performance result on each workload below for `1x Max 1550 w/ 1-Stack` represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training. #### ResNet50v1-5 Training Performance Results |GPUs|Ranks|Local Batch Size:
FP32, BF16|Training
Steps|Throughput w/
TF32 (images/sec)|Throughput w/
BF16 (images/sec)|Throughput Speedup
w/ AMP|Weak Scaling
w/ TF32|Weak Scaling
w/ BF16| |-|-|-|-|-|-|-|-|-| |1x Max 1550 w/ 1-Stack|1|256, 512|5000|918.96|1766.53|1.92x|1.00|1.00| |1x Max 1550 w/ 2-Stack|2|256, 512|5000|1762.76|3461.86|1.96x|1.92|1.96| |4x Max 1550|8|256, 256|5000|NA|12278.32|NA|NA|6.95| #### BERT-Large Phase2 Training Performance Results |GPUs|Ranks|Local
Batch Size
x Accumulation Steps|Training
Steps|Throughput
w/ TF32
(sequences/sec)|Throughput
w/ BF16
(sequences/sec)|Throughput Speedup
w/ AMP|Weak Scaling
w/ TF32|Weak Scaling
w/ BF16| |-|-|-|-|-|-|-|-|-| |1x Max 1550 w/ 1-Stack|1|32 x 30|20|36.22|93.22|2.57x|1.00|1.00| |1x Max 1550 w/ 2-Stack|2|32 x 30|20|74.40|182.57|2.45x|2.05|1.96| |4x Max 1550|8|32 x 30|20|NA|692.11|NA|NA|7.42| #### Mask-RCNN Training Performance Results |GPUs|Ranks|Local Batch Size|Training Steps|Throughput w/ BF16 (images/sec)|Weak Scaling w/ BF16| |-|-|-|-|-|-| |1x Max 1550 w/ 1-Stack|1|4|20|29.03|1.00| |1x Max 1550 w/ 2-Stack|2|4|20|55.51|1.91| #### Medical Image 3D U-Net Training Performance Results |GPUs|Ranks|Local Batch Size|Training Steps|Throughput w/ BF16 (samples/sec)|Weak Scaling w/ BF16| |-|-|-|-|-|-| |1x Max 1550 w/ 1-Stack|1|1|1000|12.81|1.00| |1x Max 1550 w/ 2-Stack|2|1|1000|23.56|1.84| |4x Max 1550|8|1|1000|87.07|6.80| ## Inference Performance Results ### Inference Performance on 1x Intel Data Center GPU Flex 170 The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU). >**Note**: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size. #### ResNet50v1-5 Inference Performance Results |GPUs|Dataset|Image Size|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| |-|-|-|-|-|-|-|-| |1x Flex 170|Dummy|224x224|Online|1|INT8|5000|435.01| |1x Flex 170|Dummy|224x224|Batch|1024|INT8|5000|9842.75| #### EfficientNet-B0 Inference Performance Results |GPUs|Dataset|Image Size|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| |-|-|-|-|-|-|-|-| |1x Flex 170|Dummy|224x224|Batch|64|FP16 (AMP)|50|3007.60| |1x Flex 170|Dummy|224x224|Batch|128|FP16 (AMP)|50|3587.29| #### EfficientNet-B3 Inference Performance Results |GPUs|Dataset|Image Size|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| |-|-|-|-|-|-|-|-| |1x Flex 170|Dummy|300x300|Batch|64|FP16 (AMP)|50|928.56| |1x Flex 170|Dummy|300x300|Batch|128|FP16 (AMP)|50|968.83| #### Mask-RCNN Inference Performance Results |GPUs|Dataset|Mode|Batch Size|Data Type|Inference Steps|Throughput (images/sec)| |-|-|-|-|-|-|-| |1x Flex 170|COCO 2017|Online|1|FP16 (AMP)|5000|19.38| |1x Flex 170|COCO 2017|Batch|16|FP16 (AMP)|312|43.02| #### Stable Diffusion v1-4 Inference Performance Results |GPUs|Dataset|Output
Image Size|Mode|Batch Size|Data Type|Diffusion Steps|Throughput
(iterations/sec)|Throughput Speedup
w/ FP16| |-|-|-|-|-|-|-|-|-| |1x Flex 170|Text Prompt|512x512|Online|1|FP32|50|2.91|1.00x| |1x Flex 170|Text Prompt|512x512|Online|1|FP16 (pure)|50|6.53|2.24x| ## Configuration ### Software Configuration #### Software Configuration for Intel Max 1550 GPU |Software Component|Version| |-|-| |GPU Driver|[736.25](https://dgpu-docs.intel.com/releases/stable_736_25_20231031.html)| |Intel® oneAPI Base Toolkit|2024.0| |TensorFlow|v2.14.0| |Intel® Extension for TensorFlow\*|v2.14.0.1| |Intel® Optimization for Horovod\*|v0.28.1.2| #### Software Configuration for Intel Flex 170 GPU |Software Component|Version| |-|-| |GPU Driver|[736.25](https://dgpu-docs.intel.com/releases/stable_736_25_20231031.html)| |Intel® oneAPI Base Toolkit|2024.0| |TensorFlow|v2.14.0| |Intel® Extension for TensorFlow\*|v2.14.0.1| ### Hardware Configuration #### Hardware Configuration for Intel Max 1550 GPU |GPU System|4x Intel® Data Center GPU Max 1550| |-|-| |**Number of Nodes**|1| |**Xe®-Cores per GPU**|128 in total 2-Stack| |**Memory Size per GPU**|128 GB HBM2e in total 2-Stack| |**TDP per GPU**|600W| |**GPU ECC Setting**|OFF| |**Server Board**|Intel® Denali Pass D50DNP1SBB| |**OS**|SUSE Linux Enterprise Server 15 SP4| |**Kernel**|5.14.21-150400.24.69-default| |**CPU Model**|Intel® Xeon® Platinum 8480+ @ 2.00 GHz| |**Number of Sockets**|2| |**CPU Cores per Socket**|56| |**Hyper Threading**|ON| |**Turbo Boost**|ON| |**Automatic NUMA Balancing**|Enabled| |**CPU Frequency Governor**|Performance| |**TDP per CPU**|350W| |**Installed Memory**|1024GB (16x64GB 4800 MT/s DDR5)| |**NIC**|1x Intel® Ethernet Controller X710 for 10GBASE-T| |**Storage**|1x WD® WD_BLACK SN850X 2TB NVMe SSD| #### Hardware Configuration for Intel Flex 170 GPU |GPU System|1x Intel® Data Center GPU Flex 170| |-|-| |**Number of Nodes**|1| |**Xe®-Cores per GPU**|32| |**Memory Size per GPU**|16 GB GDDR6| |**TDP per GPU**|150W| |**GPU ECC Setting**|ON| |**Server Board**|Intel® Whitley| |**OS**|Ubuntu 22.04.3 LTS| |**Kernel**|5.15.0-57-generic| |**CPU Model**|Intel® Xeon® Gold 6336Y CPU @ 2.40GHz| |**Number of Sockets**|2| |**CPU Cores per Socket**|24| |**Hyper Threading**|ON| |**Turbo Boost**|ON| |**Automatic NUMA Balancing**|Enabled| |**CPU Frequency Governor**|Performance| |**TDP per CPU**|185W| |**Installed Memory**|128GB (8x16GB 3200 MT/s DDR4)| |**NIC**|2x Intel® Ethernet Controller X710 for 10GBASE-T,
1x Intel® 82574L Gigabit Ethernet Controller| |**Storage**|1x Intel® SSDSC2KG960G8,
1x Samsung® 870 EVO 1TB SSD| ## Additional Performance Data for Intel AI Data Center Products You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via [Performance Data for Intel® AI Data Center Products](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/performance.html).