Performance
Overview
This page shows performance boost with Intel® Extension for PyTorch* on several popular topologies.
Performance Data for Intel® AI Data Center Products
Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at Intel® Developer Zone article.
INT8 with v1.11
Performance Numbers
Hardware | Workload1 | Precision | Throughput Inference2 | Realtime Inference3 | Model Type | Dataset | Input Data Shape | Tunable Parameters | ||
---|---|---|---|---|---|---|---|---|---|---|
Batch Size | Boost Ratio | Batch Size | Boost Ratio | |||||||
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | ResNet50 | INT8 | 80 | 1.83x | 1 | 1.44x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
SSD-ResNet34 | INT8 | 80 | 2.16x | 1 | 1.83x | Computer Vision | COCO | Input shape [3, 1200, 1200] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
ResNext 32x16d | INT8 | 80 | 1.81x | 1 | 1.21x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
VGG-11 | INT8 | 80 | 1.75x | 1 | 1.19x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
ShuffleNetv2_x1.0 | INT8 | 80 | 2.07x | 1 | 1.47x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; |
|
BERT-Large | INT8 | 80 | 2.78x | 1 | 2.04x | NLP | Squad | max_seq_len=384 Task: Question Answering |
Jemalloc; Intel(R) OpenMP; inference scripts |
|
Bert-Base | INT8 | 80 | 2.05x | 1 | 1.96x | NLP | MRPC | max_seq_len=128 Task: Text Classification |
Jemalloc; Intel(R) OpenMP; inference scripts |
|
DistilBERT-Base | INT8 | 80 | 2.12x | 1 | 1.57x | NLP | Squad | max_seq_len=384 Task: Question Answering |
Jemalloc; Intel(R) OpenMP; inference scripts |
1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.
Note: Performance numbers with stock PyTorch are measured with its most performant configuration.
Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.
Accuracy
Workload | Metric | FP32 | INT8 | INT8/FP32 |
---|---|---|---|---|
BERT-base_text_classification | f1 | 0.81 | 0.81 | 99.79% |
BERT-Large | f1 | 93.16 | 93.02 | 99.85% |
Distilbert-base | f1 | 86.84 | 86.13 | 99.19% |
ResNet50 | Top1 | 76.15 | 75.98 | 99.78% |
ResNext 32x16d | Top1 | 84.17 | 84.05 | 99.86% |
SSD-ResNet34 | mAP | 0.200 | 0.199 | 99.48% |
VGG11 | Top1 | 69.04 | 67.96 | 98.44% |
Shufflenetv2_x1.0 | Top1 | 69.36 | 67.92 | 97.93%1 |
1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm.
Configuration
Software Version
Software | Version |
---|---|
PyTorch | v1.11.0 |
Intel® Extension for PyTorch* | v1.11.0 |
Hardware Configuration
3rd Generation Intel® Xeon® Scalable Processors | |
---|---|
CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz |
Number of nodes | 1 |
Number of sockets | 2 |
Cores/Socket | 40 |
Threads/Core | 2 |
uCode | 0xd0002a0 |
Hyper-Threading | ON |
TurboBoost | ON |
BIOS version | 04.12.02 |
Number of DDR Memory slots | 16 |
Capacity of DDR memory per slot | 16GB |
DDR frequency | 3200 |
Total Memory/Node (DDR+DCPMM) | 256GB |
Host OS | CentOS Linux release 8.4.2105 |
Host Kernel | 4.18.0-305.10.2.el8_4.x86_64 |
Docker OS | Ubuntu 18.04.5 LTS |
Spectre-Meltdown Mitigation | Mitigated |
FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance
Performance Numbers
Hardware | Workload1 | Precision | Throughput Inference2 | Real-time Inference3 | Model Type | Dataset | Input Data Shape | Tunable Parameters | ||
---|---|---|---|---|---|---|---|---|---|---|
Batch Size | Boost Ratio | Batch Size | Boost Ratio | |||||||
AWS EC2 C6i.2xlarge | ResNet50 | Float32 | 64 | 1.24x | 1 | 1.31x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
ResNext 32x16d | Float32 | 64 | 1.07x | 1 | 1.05x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
VGG-11 | Float32 | 64 | 1.15x | 1 | 1.21x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
ShuffleNetv2_x1.0 | Float32 | 64 | 1.12x | 1 | 1.30x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; |
|
MobileNet v2 | Float32 | 64 | 1.08x | 1 | 1.12x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; |
|
BERT-Large | Float32 | 64 | 1.05x | 1 | 1.03x | NLP | Squad | max_seq_len=384 Task: Question Answering |
Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64 |
|
Bert-Base | Float32 | 64 | 1.08x | 1 | 1.09x | NLP | MRPC | max_seq_len=128 Task: Text Classification |
Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128 |
1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.
Note: Performance numbers with stock PyTorch are measured with its most performant configuration.
Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.
Configuration
Software Version
Software | Version |
---|---|
PyTorch | v1.11.0 |
Intel® Extension for PyTorch* | v1.11.200 |
FP32 and BFloat16 with v1.10
Performance Numbers
Hardware | Workload1 | Precision | Throughput Inference2 | Real-time Inference3 | Model Type | Dataset | Input Data Shape | Tunable Parameters | ||
---|---|---|---|---|---|---|---|---|---|---|
Batch Size | Boost Ratio | Batch Size | Boost Ratio | |||||||
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | ResNet50 | Float32 | 80 | 1.39x | 1 | 1.35x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
SSD-ResNet34 | Float32 | 160 | 1.55x | 1 | 1.06x | Computer Vision | COCO | Input shape [3, 1200, 1200] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
ResNext 32x16d | Float32 | 80 | 1.08x | 1 | 1.08x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
Faster R-CNN ResNet50 FPN | Float32 | 80 | 1.71x | 1 | 1.07x | Computer Vision | COCO | Input shape [3, 1200, 1200] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
VGG-11 | Float32 | 160 | 1.20x | 1 | 1.13x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; inference scripts |
|
ShuffleNetv2_x1.0 | Float32 | 160 | 1.32x | 1 | 1.20x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; |
|
MobileNet v2 | Float32 | 160 | 1.48x | 1 | 1.12x | Computer Vision | ImageNet | Input shape [3, 224, 224] |
Default memory allocator; Intel(R) OpenMP; |
|
DLRM | Float32 | 80 | 1.11x | 1 | - | Recommendation | Terabyte | - | Default memory allocator; Intel(R) OpenMP; inference scripts |
|
BERT-Large | Float32 | 80 | 1.14x | 1 | 1.02x | NLP | Squad | max_seq_len=384 Task: Question Answering |
Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64 |
|
Bert-Base | Float32 | 160 | 1.10x | 1 | 1.33x | NLP | MRPC | max_seq_len=128 Task: Text Classification |
Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128 |
|
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz | BERT-Large | BFloat16 | 56 | 1.67x | 1 | 1.45x | NLP | Squad | max_seq_len=384 Task: Question Answering |
Jemalloc; Intel(R) OpenMP; inference scripts |
Bert-Base | BFloat16 | 112 | 1.77x | 1 | 1.18x | NLP | MRPC | max_seq_len=128 Task: Text Classification |
Jemalloc; Intel(R) OpenMP; inference scripts |
1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.
Note: Performance numbers with stock PyTorch are measured with its most performant configuration.
Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.
Configuration
Software Version
Software | Version |
---|---|
PyTorch | v1.10.1 |
Intel® Extension for PyTorch* | v1.10.100 |
Hardware Configuration
3rd Generation Intel® Xeon® Scalable Processors | Products formerly Cooper Lake | |
---|---|---|
CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz |
Number of nodes | 1 | 1 |
Number of sockets | 2 | 2 |
Cores/Socket | 40 | 28 |
Threads/Core | 2 | 2 |
uCode | 0xd0002a0 | 0x700001c |
Hyper-Threading | ON | ON |
TurboBoost | ON | ON |
BIOS version | 04.12.02 | WLYDCRB1.SYS.0016.P29.2006080250 |
Number of DDR Memory slots | 16 | 12 |
Capacity of DDR memory per slot | 16GB | 64GB |
DDR frequency | 3200 | 3200 |
Total Memory/Node (DDR+DCPMM) | 256GB | 768GB |
Host OS | CentOS Linux release 8.4.2105 | Ubuntu 18.04.4 LTS |
Host Kernel | 4.18.0-305.10.2.el8_4.x86_64 | 4.15.0-76-generic |
Docker OS | Ubuntu 18.04.5 LTS | Ubuntu 18.04.5 LTS |
Spectre-Meltdown Mitigation | Mitigated | Mitigated |