Performance

Overview

This page shows performance boost with Intel® Extension for PyTorch* on several popular topologies.

Performance Data for Intel® AI Data Center Products

Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at Intel® Developer Zone article.

INT8 with v1.11

Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Realtime Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	ResNet50	INT8	80	1.83x	1	1.44x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	SSD-ResNet34	INT8	80	2.16x	1	1.83x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	INT8	80	1.81x	1	1.21x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	INT8	80	1.75x	1	1.19x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	INT8	80	2.07x	1	1.47x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	BERT-Large	INT8	80	2.78x	1	2.04x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts
	Bert-Base	INT8	80	2.05x	1	1.96x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts
	DistilBERT-Base	INT8	80	2.12x	1	1.57x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

Accuracy

Workload	Metric	FP32	INT8	INT8/FP32
BERT-base_text_classification	f1	0.81	0.81	99.79%
BERT-Large	f1	93.16	93.02	99.85%
Distilbert-base	f1	86.84	86.13	99.19%
ResNet50	Top1	76.15	75.98	99.78%
ResNext 32x16d	Top1	84.17	84.05	99.86%
SSD-ResNet34	mAP	0.200	0.199	99.48%
VGG11	Top1	69.04	67.96	98.44%
Shufflenetv2_x1.0	Top1	69.36	67.92	97.93%¹

^{1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm.}

Configuration

Software Version

Software	Version
PyTorch	v1.11.0
Intel® Extension for PyTorch*	v1.11.0

Hardware Configuration

	3rd Generation Intel® Xeon® Scalable Processors
CPU	Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Number of nodes	1
Number of sockets	2
Cores/Socket	40
Threads/Core	2
uCode	0xd0002a0
Hyper-Threading	ON
TurboBoost	ON
BIOS version	04.12.02
Number of DDR Memory slots	16
Capacity of DDR memory per slot	16GB
DDR frequency	3200
Total Memory/Node (DDR+DCPMM)	256GB
Host OS	CentOS Linux release 8.4.2105
Host Kernel	4.18.0-305.10.2.el8_4.x86_64
Docker OS	Ubuntu 18.04.5 LTS
Spectre-Meltdown Mitigation	Mitigated

FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance

Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Real-time Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
AWS EC2 C6i.2xlarge	ResNet50	Float32	64	1.24x	1	1.31x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	Float32	64	1.07x	1	1.05x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	Float32	64	1.15x	1	1.21x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	Float32	64	1.12x	1	1.30x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	MobileNet v2	Float32	64	1.08x	1	1.12x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	BERT-Large	Float32	64	1.05x	1	1.03x	NLP	Squad	max_seq_len=384 Task: Question Answering	Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
	Bert-Base	Float32	64	1.08x	1	1.09x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

Configuration

Software Version

Software	Version
PyTorch	v1.11.0
Intel® Extension for PyTorch*	v1.11.200

FP32 and BFloat16 with v1.10

Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Real-time Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	ResNet50	Float32	80	1.39x	1	1.35x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	SSD-ResNet34	Float32	160	1.55x	1	1.06x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	Float32	80	1.08x	1	1.08x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	Faster R-CNN ResNet50 FPN	Float32	80	1.71x	1	1.07x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	Float32	160	1.20x	1	1.13x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	Float32	160	1.32x	1	1.20x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	MobileNet v2	Float32	160	1.48x	1	1.12x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	DLRM	Float32	80	1.11x	1	-	Recommendation	Terabyte	-	Default memory allocator; Intel(R) OpenMP; inference scripts
	BERT-Large	Float32	80	1.14x	1	1.02x	NLP	Squad	max_seq_len=384 Task: Question Answering	Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
	Bert-Base	Float32	160	1.10x	1	1.33x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	BERT-Large	BFloat16	56	1.67x	1	1.45x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	Bert-Base	BFloat16	112	1.77x	1	1.18x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

Configuration

Software Version

Software	Version
PyTorch	v1.10.1
Intel® Extension for PyTorch*	v1.10.100

Hardware Configuration

	3rd Generation Intel® Xeon® Scalable Processors	Products formerly Cooper Lake
CPU	Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz

Number of nodes	1	1
Number of sockets	2	2
Cores/Socket	40	28
Threads/Core	2	2
uCode	0xd0002a0	0x700001c
Hyper-Threading	ON	ON
TurboBoost	ON	ON
BIOS version	04.12.02	WLYDCRB1.SYS.0016.P29.2006080250
Number of DDR Memory slots	16	12
Capacity of DDR memory per slot	16GB	64GB
DDR frequency	3200	3200
Total Memory/Node (DDR+DCPMM)	256GB	768GB
Host OS	CentOS Linux release 8.4.2105	Ubuntu 18.04.4 LTS
Host Kernel	4.18.0-305.10.2.el8_4.x86_64	4.15.0-76-generic
Docker OS	Ubuntu 18.04.5 LTS	Ubuntu 18.04.5 LTS
Spectre-Meltdown Mitigation	Mitigated	Mitigated