Performance =========== ## Overview This page shows performance boost with Intel® Extension for PyTorch\* on several popular topologies. ## Performance Data for Intel® AI Data Center Products Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at [Intel® Developer Zone article](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/performance.html). ## INT8 with v1.11 ### Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Realtime Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	ResNet50	INT8	80	1.83x	1	1.44x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	SSD-ResNet34	INT8	80	2.16x	1	1.83x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	INT8	80	1.81x	1	1.21x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	INT8	80	1.75x	1	1.19x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	INT8	80	2.07x	1	1.47x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	BERT-Large	INT8	80	2.78x	1	2.04x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts
	Bert-Base	INT8	80	2.05x	1	1.96x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts
	DistilBERT-Base	INT8	80	2.12x	1	1.57x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts

Workload	Metric	FP32	INT8	INT8/FP32
BERT-base_text_classification	f1	0.81	0.81	99.79%
BERT-Large	f1	93.16	93.02	99.85%
Distilbert-base	f1	86.84	86.13	99.19%
ResNet50	Top1	76.15	75.98	99.78%
ResNext 32x16d	Top1	84.17	84.05	99.86%
SSD-ResNet34	mAP	0.200	0.199	99.48%
VGG11	Top1	69.04	67.96	98.44%
Shufflenetv2_x1.0	Top1	69.36	67.92	97.93%¹

^{1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm.}
### Configuration #### Software Version | Software | Version | | :-: | :-: | | PyTorch | [v1.11.0](https://pytorch.org/get-started/locally/) | | Intel® Extension for PyTorch\* | [v1.11.0](https://github.com/intel/intel-extension-for-pytorch/releases) | #### Hardware Configuration | | 3rd Generation Intel® Xeon® Scalable Processors | | :-: | :-: | | CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | | Number of nodes | 1 | | Number of sockets | 2 | | Cores/Socket | 40 | | Threads/Core | 2 | | uCode | 0xd0002a0 | | Hyper-Threading | ON | | TurboBoost | ON | | BIOS version | 04.12.02 | | Number of DDR Memory slots | 16 | | Capacity of DDR memory per slot | 16GB | | DDR frequency | 3200 | | Total Memory/Node (DDR+DCPMM) | 256GB | | Host OS | CentOS Linux release 8.4.2105 | | Host Kernel | 4.18.0-305.10.2.el8\_4.x86\_64 | | Docker OS | Ubuntu 18.04.5 LTS | | [Spectre-Meltdown Mitigation](https://github.com/speed47/spectre-meltdown-checker) | Mitigated | ## FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance ### Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Real-time Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
AWS EC2 C6i.2xlarge	ResNet50	Float32	64	1.24x	1	1.31x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	Float32	64	1.07x	1	1.05x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	Float32	64	1.15x	1	1.21x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	Float32	64	1.12x	1	1.30x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	MobileNet v2	Float32	64	1.08x	1	1.12x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	BERT-Large	Float32	64	1.05x	1	1.03x	NLP	Squad	max_seq_len=384 Task: Question Answering	Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
	Bert-Base	Float32	64	1.08x	1	1.09x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128

Hardware	Workload¹	Precision	Throughput Inference²		Real-time Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	ResNet50	Float32	80	1.39x	1	1.35x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	SSD-ResNet34	Float32	160	1.55x	1	1.06x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	Float32	80	1.08x	1	1.08x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	Faster R-CNN ResNet50 FPN	Float32	80	1.71x	1	1.07x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	Float32	160	1.20x	1	1.13x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	Float32	160	1.32x	1	1.20x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	MobileNet v2	Float32	160	1.48x	1	1.12x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	DLRM	Float32	80	1.11x	1	-	Recommendation	Terabyte	-	Default memory allocator; Intel(R) OpenMP; inference scripts
	BERT-Large	Float32	80	1.14x	1	1.02x	NLP	Squad	max_seq_len=384 Task: Question Answering	Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
	Bert-Base	Float32	160	1.10x	1	1.33x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	BERT-Large	BFloat16	56	1.67x	1	1.45x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	Bert-Base	BFloat16	112	1.77x	1	1.18x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}
*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. *Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. ### Configuration #### Software Version | Software | Version | | :-: | :-: | | PyTorch | [v1.10.1](https://pytorch.org/get-started/locally/) | | Intel® Extension for PyTorch\* | [v1.10.100](https://github.com/intel/intel-extension-for-pytorch/releases) | #### Hardware Configuration | | 3rd Generation Intel® Xeon® Scalable Processors | Products formerly Cooper Lake | | :-: | :-: | :-: | | CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz | | Number of nodes | 1 | 1 | | Number of sockets | 2 | 2 | | Cores/Socket | 40 | 28 | | Threads/Core | 2 | 2 | | uCode | 0xd0002a0 | 0x700001c | | Hyper-Threading | ON | ON | | TurboBoost | ON | ON | | BIOS version | 04.12.02 | WLYDCRB1.SYS.0016.P29.2006080250 | | Number of DDR Memory slots | 16 | 12 | | Capacity of DDR memory per slot | 16GB | 64GB | | DDR frequency | 3200 | 3200 | | Total Memory/Node (DDR+DCPMM) | 256GB | 768GB | | Host OS | CentOS Linux release 8.4.2105 | Ubuntu 18.04.4 LTS | | Host Kernel | 4.18.0-305.10.2.el8\_4.x86\_64 | 4.15.0-76-generic | | Docker OS | Ubuntu 18.04.5 LTS | Ubuntu 18.04.5 LTS | | [Spectre-Meltdown Mitigation](https://github.com/speed47/spectre-meltdown-checker) | Mitigated | Mitigated |