Performance Data

Overview
Models
- Training Workloads
- Inference Workloads
Training Accuracy Results
- Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550
Training Performance Results
- Training Performance on 1-node of 4x Intel Data Center GPU Max 1550
Inference Performance Results
- Inference Performance on 1x Intel Data Center GPU Flex 170
Configuration
- Software Configuration
  - Software Configuration for Intel Max 1550 GPU
  - Software Configuration for Intel Flex 170 GPU
- Hardware Configuration
  - Hardware Configuration for Intel Max 1550 GPU
  - Hardware Configuration for Intel Flex 170 GPU
Additional Performance Data for Intel AI Data Center Products

Overview

This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in examples.

Models

The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model.

Training Workloads

Model	Original Model Repo	ITEX Step-by-Step Guide
ResNet50v1.5	TensorFlow-Models/ResNet50v1.5	Resnet50 train on Intel GPU
BERT-Large	DeepLearningExamples/BERT	Accelerate BERT-Large Pretraining on Intel GPU
Mask-RCNN	DeepLearningExamples/Mask-RCNN	Accelerate Mask R-CNN Training on Intel GPU
3D-UNet	DeepLearningExamples/3D-UNet	Accelerate 3D-UNet Training for medical image segmentation on Intel GPU

Inference Workloads

Model	Original Model Repo	ITEX Step-by-Step Guide
ResNet50v1.5	Intel-Reference-Models/ResNet50v1.5	ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow*
EfficientNet-B0	Keras-Applications/EfficientNet	Use the exact same codes and instructions as in the orignal model repo
EfficientNet-B3	Keras-Applications/EfficientNet	Use the exact same codes and instructions as in the orignal model repo
Mask-RCNN	DeepLearningExamples/Mask-RCNN	Use the exact same codes and instructions as in the orignal model repo
Stable Diffusion v1-4	KerasCV/Stable-Diffusion	Stable Diffusion Inference for Text2Image on Intel GPU

Training Accuracy Results

Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550

The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU).

	Pre-training Phase1	Pre-training Phase2	Fine-Tuning
Dataset	Wikipedia and BookCorpus	Wikipedia and BookCorpus	SQuAD 1.1
Maximum Sequence Length	128	512	384
Data Type	BF16	BF16	BF16
Throughput (sequences/sec)	3265.35	699.25	523.55
Time to Train (hours)	39.32	20.40	0.67
Loss	1.6047	1.3870	0.6867

Training Performance Results

Training Performance on 1-node of 4x Intel Data Center GPU Max 1550

The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow* and Intel® Optimization for Horovod*.

Note: The training performance result on each workload below for 1x Max 1550 w/ 1-Stack represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training.

ResNet50v1-5 Training Performance Results

GPUs	Ranks	Local Batch Size: FP32, BF16	Training Steps	Throughput w/ TF32 (images/sec)	Throughput w/ BF16 (images/sec)	Throughput Speedup w/ AMP	Weak Scaling w/ TF32	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	256, 512	5000	918.96	1766.53	1.92x	1.00	1.00
1x Max 1550 w/ 2-Stack	2	256, 512	5000	1762.76	3461.86	1.96x	1.92	1.96
4x Max 1550	8	256, 256	5000	NA	12278.32	NA	NA	6.95

BERT-Large Phase2 Training Performance Results

GPUs	Ranks	Local Batch Size x Accumulation Steps	Training Steps	Throughput w/ TF32 (sequences/sec)	Throughput w/ BF16 (sequences/sec)	Throughput Speedup w/ AMP	Weak Scaling w/ TF32	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	32 x 30	20	36.22	93.22	2.57x	1.00	1.00
1x Max 1550 w/ 2-Stack	2	32 x 30	20	74.40	182.57	2.45x	2.05	1.96
4x Max 1550	8	32 x 30	20	NA	692.11	NA	NA	7.42

Mask-RCNN Training Performance Results

GPUs	Ranks	Local Batch Size	Training Steps	Throughput w/ BF16 (images/sec)	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	4	20	29.03	1.00
1x Max 1550 w/ 2-Stack	2	4	20	55.51	1.91

Medical Image 3D U-Net Training Performance Results

GPUs	Ranks	Local Batch Size	Training Steps	Throughput w/ BF16 (samples/sec)	Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack	1	1	1000	12.81	1.00
1x Max 1550 w/ 2-Stack	2	1	1000	23.56	1.84
4x Max 1550	8	1	1000	87.07	6.80

Inference Performance Results

Inference Performance on 1x Intel Data Center GPU Flex 170

The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU).

Note: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size.

ResNet50v1-5 Inference Performance Results

GPUs	Dataset	Image Size	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	Dummy	224x224	Online	1	INT8	5000	435.01
1x Flex 170	Dummy	224x224	Batch	1024	INT8	5000	9842.75

EfficientNet-B0 Inference Performance Results

GPUs	Dataset	Image Size	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	Dummy	224x224	Batch	64	FP16 (AMP)	50	3007.60
1x Flex 170	Dummy	224x224	Batch	128	FP16 (AMP)	50	3587.29

EfficientNet-B3 Inference Performance Results

GPUs	Dataset	Image Size	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	Dummy	300x300	Batch	64	FP16 (AMP)	50	928.56
1x Flex 170	Dummy	300x300	Batch	128	FP16 (AMP)	50	968.83

Mask-RCNN Inference Performance Results

GPUs	Dataset	Mode	Batch Size	Data Type	Inference Steps	Throughput (images/sec)
1x Flex 170	COCO 2017	Online	1	FP16 (AMP)	5000	19.38
1x Flex 170	COCO 2017	Batch	16	FP16 (AMP)	312	43.02

Stable Diffusion v1-4 Inference Performance Results

GPUs	Dataset	Output Image Size	Mode	Batch Size	Data Type	Diffusion Steps	Throughput (iterations/sec)	Throughput Speedup w/ FP16
1x Flex 170	Text Prompt	512x512	Online	1	FP32	50	2.91	1.00x
1x Flex 170	Text Prompt	512x512	Online	1	FP16 (pure)	50	6.53	2.24x

Configuration

Software Configuration

Software Configuration for Intel Max 1550 GPU

Software Component	Version
GPU Driver	736.25
Intel® oneAPI Base Toolkit	2024.0
TensorFlow	v2.14.0
Intel® Extension for TensorFlow*	v2.14.0.1
Intel® Optimization for Horovod*	v0.28.1.2

Software Configuration for Intel Flex 170 GPU

Software Component	Version
GPU Driver	736.25
Intel® oneAPI Base Toolkit	2024.0
TensorFlow	v2.14.0
Intel® Extension for TensorFlow*	v2.14.0.1

Hardware Configuration

Hardware Configuration for Intel Max 1550 GPU

GPU System	4x Intel® Data Center GPU Max 1550
Number of Nodes	1
Xe®-Cores per GPU	128 in total 2-Stack
Memory Size per GPU	128 GB HBM2e in total 2-Stack
TDP per GPU	600W
GPU ECC Setting	OFF
Server Board	Intel® Denali Pass D50DNP1SBB
OS	SUSE Linux Enterprise Server 15 SP4
Kernel	5.14.21-150400.24.69-default
CPU Model	Intel® Xeon® Platinum 8480+ @ 2.00 GHz
Number of Sockets	2
CPU Cores per Socket	56
Hyper Threading	ON
Turbo Boost	ON
Automatic NUMA Balancing	Enabled
CPU Frequency Governor	Performance
TDP per CPU	350W
Installed Memory	1024GB (16x64GB 4800 MT/s DDR5)
NIC	1x Intel® Ethernet Controller X710 for 10GBASE-T
Storage	1x WD® WD_BLACK SN850X 2TB NVMe SSD

Hardware Configuration for Intel Flex 170 GPU

GPU System	1x Intel® Data Center GPU Flex 170
Number of Nodes	1
Xe®-Cores per GPU	32
Memory Size per GPU	16 GB GDDR6
TDP per GPU	150W
GPU ECC Setting	ON
Server Board	Intel® Whitley
OS	Ubuntu 22.04.3 LTS
Kernel	5.15.0-57-generic
CPU Model	Intel® Xeon® Gold 6336Y CPU @ 2.40GHz
Number of Sockets	2
CPU Cores per Socket	24
Hyper Threading	ON
Turbo Boost	ON
Automatic NUMA Balancing	Enabled
CPU Frequency Governor	Performance
TDP per CPU	185W
Installed Memory	128GB (8x16GB 3200 MT/s DDR4)
NIC	2x Intel® Ethernet Controller X710 for 10GBASE-T, 1x Intel® 82574L Gigabit Ethernet Controller
Storage	1x Intel® SSDSC2KG960G8, 1x Samsung® 870 EVO 1TB SSD

Additional Performance Data for Intel AI Data Center Products

You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via Performance Data for Intel® AI Data Center Products.