Performance Data
Overview
This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in examples.
Models
The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model.
Training Workloads
Model | Original Model Repo | ITEX Step-by-Step Guide |
---|---|---|
ResNet50v1.5 | TensorFlow-Models/ResNet50v1.5 | Resnet50 train on Intel GPU |
BERT-Large | DeepLearningExamples/BERT | Accelerate BERT-Large Pretraining on Intel GPU |
Mask-RCNN | DeepLearningExamples/Mask-RCNN | Accelerate Mask R-CNN Training on Intel GPU |
3D-UNet | DeepLearningExamples/3D-UNet | Accelerate 3D-UNet Training for medical image segmentation on Intel GPU |
Inference Workloads
Model | Original Model Repo | ITEX Step-by-Step Guide |
---|---|---|
ResNet50v1.5 | Intel-Reference-Models/ResNet50v1.5 | ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow* |
EfficientNet-B0 | Keras-Applications/EfficientNet | Use the exact same codes and instructions as in the orignal model repo |
EfficientNet-B3 | Keras-Applications/EfficientNet | Use the exact same codes and instructions as in the orignal model repo |
Mask-RCNN | DeepLearningExamples/Mask-RCNN | Use the exact same codes and instructions as in the orignal model repo |
Stable Diffusion v1-4 | KerasCV/Stable-Diffusion | Stable Diffusion Inference for Text2Image on Intel GPU |
Training Accuracy Results
Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550
The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU).
Pre-training Phase1 | Pre-training Phase2 | Fine-Tuning | |
---|---|---|---|
Dataset | Wikipedia and BookCorpus | Wikipedia and BookCorpus | SQuAD 1.1 |
Maximum Sequence Length | 128 | 512 | 384 |
Data Type | BF16 | BF16 | BF16 |
Throughput (sequences/sec) | 3265.35 | 699.25 | 523.55 |
Time to Train (hours) | 39.32 | 20.40 | 0.67 |
Loss | 1.6047 | 1.3870 | 0.6867 |
Training Performance Results
Training Performance on 1-node of 4x Intel Data Center GPU Max 1550
The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow* and Intel® Optimization for Horovod*.
Note: The training performance result on each workload below for
1x Max 1550 w/ 1-Stack
represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training.
ResNet50v1-5 Training Performance Results
GPUs | Ranks | Local Batch Size: FP32, BF16 |
Training Steps |
Throughput w/ TF32 (images/sec) |
Throughput w/ BF16 (images/sec) |
Throughput Speedup w/ AMP |
Weak Scaling w/ TF32 |
Weak Scaling w/ BF16 |
---|---|---|---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 256, 512 | 5000 | 918.96 | 1766.53 | 1.92x | 1.00 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 256, 512 | 5000 | 1762.76 | 3461.86 | 1.96x | 1.92 | 1.96 |
4x Max 1550 | 8 | 256, 256 | 5000 | NA | 12278.32 | NA | NA | 6.95 |
BERT-Large Phase2 Training Performance Results
GPUs | Ranks | Local Batch Size x Accumulation Steps |
Training Steps |
Throughput w/ TF32 (sequences/sec) |
Throughput w/ BF16 (sequences/sec) |
Throughput Speedup w/ AMP |
Weak Scaling w/ TF32 |
Weak Scaling w/ BF16 |
---|---|---|---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 32 x 30 | 20 | 36.22 | 93.22 | 2.57x | 1.00 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 32 x 30 | 20 | 74.40 | 182.57 | 2.45x | 2.05 | 1.96 |
4x Max 1550 | 8 | 32 x 30 | 20 | NA | 692.11 | NA | NA | 7.42 |
Mask-RCNN Training Performance Results
GPUs | Ranks | Local Batch Size | Training Steps | Throughput w/ BF16 (images/sec) | Weak Scaling w/ BF16 |
---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 4 | 20 | 29.03 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 4 | 20 | 55.51 | 1.91 |
Medical Image 3D U-Net Training Performance Results
GPUs | Ranks | Local Batch Size | Training Steps | Throughput w/ BF16 (samples/sec) | Weak Scaling w/ BF16 |
---|---|---|---|---|---|
1x Max 1550 w/ 1-Stack | 1 | 1 | 1000 | 12.81 | 1.00 |
1x Max 1550 w/ 2-Stack | 2 | 1 | 1000 | 23.56 | 1.84 |
4x Max 1550 | 8 | 1 | 1000 | 87.07 | 6.80 |
Inference Performance Results
Inference Performance on 1x Intel Data Center GPU Flex 170
The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU).
Note: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size.
ResNet50v1-5 Inference Performance Results
GPUs | Dataset | Image Size | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|---|
1x Flex 170 | Dummy | 224x224 | Online | 1 | INT8 | 5000 | 435.01 |
1x Flex 170 | Dummy | 224x224 | Batch | 1024 | INT8 | 5000 | 9842.75 |
EfficientNet-B0 Inference Performance Results
GPUs | Dataset | Image Size | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|---|
1x Flex 170 | Dummy | 224x224 | Batch | 64 | FP16 (AMP) | 50 | 3007.60 |
1x Flex 170 | Dummy | 224x224 | Batch | 128 | FP16 (AMP) | 50 | 3587.29 |
EfficientNet-B3 Inference Performance Results
GPUs | Dataset | Image Size | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|---|
1x Flex 170 | Dummy | 300x300 | Batch | 64 | FP16 (AMP) | 50 | 928.56 |
1x Flex 170 | Dummy | 300x300 | Batch | 128 | FP16 (AMP) | 50 | 968.83 |
Mask-RCNN Inference Performance Results
GPUs | Dataset | Mode | Batch Size | Data Type | Inference Steps | Throughput (images/sec) |
---|---|---|---|---|---|---|
1x Flex 170 | COCO 2017 | Online | 1 | FP16 (AMP) | 5000 | 19.38 |
1x Flex 170 | COCO 2017 | Batch | 16 | FP16 (AMP) | 312 | 43.02 |
Stable Diffusion v1-4 Inference Performance Results
GPUs | Dataset | Output Image Size |
Mode | Batch Size | Data Type | Diffusion Steps | Throughput (iterations/sec) |
Throughput Speedup w/ FP16 |
---|---|---|---|---|---|---|---|---|
1x Flex 170 | Text Prompt | 512x512 | Online | 1 | FP32 | 50 | 2.91 | 1.00x |
1x Flex 170 | Text Prompt | 512x512 | Online | 1 | FP16 (pure) | 50 | 6.53 | 2.24x |
Configuration
Software Configuration
Software Configuration for Intel Max 1550 GPU
Software Component | Version |
---|---|
GPU Driver | 736.25 |
Intel® oneAPI Base Toolkit | 2024.0 |
TensorFlow | v2.14.0 |
Intel® Extension for TensorFlow* | v2.14.0.1 |
Intel® Optimization for Horovod* | v0.28.1.2 |
Software Configuration for Intel Flex 170 GPU
Software Component | Version |
---|---|
GPU Driver | 736.25 |
Intel® oneAPI Base Toolkit | 2024.0 |
TensorFlow | v2.14.0 |
Intel® Extension for TensorFlow* | v2.14.0.1 |
Hardware Configuration
Hardware Configuration for Intel Max 1550 GPU
GPU System | 4x Intel® Data Center GPU Max 1550 |
---|---|
Number of Nodes | 1 |
Xe®-Cores per GPU | 128 in total 2-Stack |
Memory Size per GPU | 128 GB HBM2e in total 2-Stack |
TDP per GPU | 600W |
GPU ECC Setting | OFF |
Server Board | Intel® Denali Pass D50DNP1SBB |
OS | SUSE Linux Enterprise Server 15 SP4 |
Kernel | 5.14.21-150400.24.69-default |
CPU Model | Intel® Xeon® Platinum 8480+ @ 2.00 GHz |
Number of Sockets | 2 |
CPU Cores per Socket | 56 |
Hyper Threading | ON |
Turbo Boost | ON |
Automatic NUMA Balancing | Enabled |
CPU Frequency Governor | Performance |
TDP per CPU | 350W |
Installed Memory | 1024GB (16x64GB 4800 MT/s DDR5) |
NIC | 1x Intel® Ethernet Controller X710 for 10GBASE-T |
Storage | 1x WD® WD_BLACK SN850X 2TB NVMe SSD |
Hardware Configuration for Intel Flex 170 GPU
GPU System | 1x Intel® Data Center GPU Flex 170 |
---|---|
Number of Nodes | 1 |
Xe®-Cores per GPU | 32 |
Memory Size per GPU | 16 GB GDDR6 |
TDP per GPU | 150W |
GPU ECC Setting | ON |
Server Board | Intel® Whitley |
OS | Ubuntu 22.04.3 LTS |
Kernel | 5.15.0-57-generic |
CPU Model | Intel® Xeon® Gold 6336Y CPU @ 2.40GHz |
Number of Sockets | 2 |
CPU Cores per Socket | 24 |
Hyper Threading | ON |
Turbo Boost | ON |
Automatic NUMA Balancing | Enabled |
CPU Frequency Governor | Performance |
TDP per CPU | 185W |
Installed Memory | 128GB (8x16GB 3200 MT/s DDR4) |
NIC | 2x Intel® Ethernet Controller X710 for 10GBASE-T, 1x Intel® 82574L Gigabit Ethernet Controller |
Storage | 1x Intel® SSDSC2KG960G8, 1x Samsung® 870 EVO 1TB SSD |
Additional Performance Data for Intel AI Data Center Products
You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via Performance Data for Intel® AI Data Center Products.