Performance

Overview

This page shows performance boost with Intel® Extension for PyTorch* on several popular topologies.

Performance Data for Intel® AI Data Center Products

Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at Intel® Developer Zone article.

LLM Performance

We benchmarked LLaMA2 7B, 13B, GPT-J 6B with test input token length set to 256 and 1024 respectively. The tests were carried out on AWS M7i and M6i instances. CPUs of M6i instances are 3rd Gen Intel® Xeon® Processors which do not have AMX instructions for BF16 computing acceleration, so we take FP32 precision for benchmarking instead of BF16 on M6i instances.

LLaMA2 7B Results

LLaMA2 13B Results

GPT-J 6B Results

The LLM inference performances on M7i and M6i instances are compared based on the above results. M7i, with the 4th Gen Xeon® processors, has a remarkable performance advantage over M6i with the 3rd Gen Xeon® processors.

M7i performance boost ratio over M6i for non-quantized (BF16 or FP32) models:

	Speedup	Throughput
LLaMA2 7B	2.47x	2.62x
LLaMA2 13B	2.57x	2.62x
GPT-J 6B	2.58x	2.85x

M7i performance boost ratio over M6i for INT8 quantized models:

	Speedup	Throughput
LLaMA2 7B	1.27x	1.38x
LLaMA2 13B	1.27x	1.27x
GPT-J 6B	1.29x	1.36x

We can also conclude that with a larger batch size the capacity of the model service can be improved at the cost of longer response latency for the individual sessions. The following table exhibits that for INT8 quantized LLaMA2-7b model on M7i instances, input batch_size=8 would increase the total throughput by 6.47x compared with batch_size=1, whereas P90 token latency gets 1.26x longer.

Batch size	Decoder latency	Total tokens per sec
1	39	26.32
8	49	170.21

*Ratio*	1.26x	6.47x

Note: Measured by Intel on 17th Aug 2023; M7i.16xLarge, M6i.16xLarge instances in US-west-2. OS-Ubuntu 22.04-lts, kernel 6.20.0-1009-aws, SW: PyTorch* 2.1 and Intel® Extension for PyTorch* 2.1/llm_feature_branch.

INT8 with v1.11

Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Realtime Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	ResNet50	INT8	80	1.83x	1	1.44x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	SSD-ResNet34	INT8	80	2.16x	1	1.83x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	INT8	80	1.81x	1	1.21x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	INT8	80	1.75x	1	1.19x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	INT8	80	2.07x	1	1.47x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	BERT-Large	INT8	80	2.78x	1	2.04x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts
	Bert-Base	INT8	80	2.05x	1	1.96x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts
	DistilBERT-Base	INT8	80	2.12x	1	1.57x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

Accuracy

Workload	Metric	FP32	INT8	INT8/FP32
BERT-base_text_classification	f1	0.81	0.81	99.79%
BERT-Large	f1	93.16	93.02	99.85%
Distilbert-base	f1	86.84	86.13	99.19%
ResNet50	Top1	76.15	75.98	99.78%
ResNext 32x16d	Top1	84.17	84.05	99.86%
SSD-ResNet34	mAP	0.200	0.199	99.48%
VGG11	Top1	69.04	67.96	98.44%
Shufflenetv2_x1.0	Top1	69.36	67.92	97.93%¹

^{1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm.}

Configuration

Software Version

Software	Version
PyTorch	v1.11.0
Intel® Extension for PyTorch*	v1.11.0

Hardware Configuration

	3rd Generation Intel® Xeon® Scalable Processors
CPU	Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
Number of nodes	1
Number of sockets	2
Cores/Socket	40
Threads/Core	2
uCode	0xd0002a0
Hyper-Threading	ON
TurboBoost	ON
BIOS version	04.12.02
Number of DDR Memory slots	16
Capacity of DDR memory per slot	16GB
DDR frequency	3200
Total Memory/Node (DDR+DCPMM)	256GB
Host OS	CentOS Linux release 8.4.2105
Host Kernel	4.18.0-305.10.2.el8_4.x86_64
Docker OS	Ubuntu 18.04.5 LTS
Spectre-Meltdown Mitigation	Mitigated

FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance

Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Real-time Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
AWS EC2 C6i.2xlarge	ResNet50	Float32	64	1.24x	1	1.31x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	Float32	64	1.07x	1	1.05x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	Float32	64	1.15x	1	1.21x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	Float32	64	1.12x	1	1.30x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	MobileNet v2	Float32	64	1.08x	1	1.12x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	BERT-Large	Float32	64	1.05x	1	1.03x	NLP	Squad	max_seq_len=384 Task: Question Answering	Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
	Bert-Base	Float32	64	1.08x	1	1.09x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

Configuration

Software Version

Software	Version
PyTorch	v1.11.0
Intel® Extension for PyTorch*	v1.11.200

FP32 and BFloat16 with v1.10

Performance Numbers

Hardware	Workload¹	Precision	Throughput Inference²		Real-time Inference³		Model Type	Dataset	Input Data Shape	Tunable Parameters
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Input Data Shape	Tunable Parameters
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	ResNet50	Float32	80	1.39x	1	1.35x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	SSD-ResNet34	Float32	160	1.55x	1	1.06x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ResNext 32x16d	Float32	80	1.08x	1	1.08x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	Faster R-CNN ResNet50 FPN	Float32	80	1.71x	1	1.07x	Computer Vision	COCO	Input shape [3, 1200, 1200]	Default memory allocator; Intel(R) OpenMP; inference scripts
	VGG-11	Float32	160	1.20x	1	1.13x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP; inference scripts
	ShuffleNetv2_x1.0	Float32	160	1.32x	1	1.20x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	MobileNet v2	Float32	160	1.48x	1	1.12x	Computer Vision	ImageNet	Input shape [3, 224, 224]	Default memory allocator; Intel(R) OpenMP;
	DLRM	Float32	80	1.11x	1	-	Recommendation	Terabyte	-	Default memory allocator; Intel(R) OpenMP; inference scripts
	BERT-Large	Float32	80	1.14x	1	1.02x	NLP	Squad	max_seq_len=384 Task: Question Answering	Default memory allocator; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
	Bert-Base	Float32	160	1.10x	1	1.33x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts; Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	BERT-Large	BFloat16	56	1.67x	1	1.45x	NLP	Squad	max_seq_len=384 Task: Question Answering	Jemalloc; Intel(R) OpenMP; inference scripts
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	Bert-Base	BFloat16	112	1.77x	1	1.18x	NLP	MRPC	max_seq_len=128 Task: Text Classification	Jemalloc; Intel(R) OpenMP; inference scripts

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

Note: Environment variable DNNL_PRIMITIVE_CACHE_CAPACITY is set to 1024.

Configuration

Software Version

Software	Version
PyTorch	v1.10.1
Intel® Extension for PyTorch*	v1.10.100

Hardware Configuration

	3rd Generation Intel® Xeon® Scalable Processors	Products formerly Cooper Lake
CPU	Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz	Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz

Number of nodes	1	1
Number of sockets	2	2
Cores/Socket	40	28
Threads/Core	2	2
uCode	0xd0002a0	0x700001c
Hyper-Threading	ON	ON
TurboBoost	ON	ON
BIOS version	04.12.02	WLYDCRB1.SYS.0016.P29.2006080250
Number of DDR Memory slots	16	12
Capacity of DDR memory per slot	16GB	64GB
DDR frequency	3200	3200
Total Memory/Node (DDR+DCPMM)	256GB	768GB
Host OS	CentOS Linux release 8.4.2105	Ubuntu 18.04.4 LTS
Host Kernel	4.18.0-305.10.2.el8_4.x86_64	4.15.0-76-generic
Docker OS	Ubuntu 18.04.5 LTS	Ubuntu 18.04.5 LTS
Spectre-Meltdown Mitigation	Mitigated	Mitigated