Performance =========== ## Overview This page shows performance boost with Intel® Extension for PyTorch\* on several popular topologies. ## Performance Data for Intel® AI Data Center Products Find the latest performance data for 4th gen Intel® Xeon® Scalable processors and 3rd gen Intel® Xeon® processors, including detailed hardware and software configurations, at [Intel® Developer Zone article](https://www.intel.com/content/www/us/en/developer/topic-technology/artificial-intelligence/performance.html). ## LLM Performance We benchmarked LLaMA2 7B, 13B, GPT-J 6B with test input token length set to 256 and 1024 respectively. The tests were carried out on AWS M7i and M6i instances. CPUs of M6i instances are 3rd Gen Intel® Xeon® Processors which do not have AMX instructions for BF16 computing acceleration, so we take FP32 precision for benchmarking instead of BF16 on M6i instances. ![LLaMA2 7B Results](../../images/performance/m7i_m6i_comp_llama7b.png) ![LLaMA2 13B Results](../../images/performance/m7i_m6i_comp_llama13b.png) ![GPT-J 6B Results](../../images/performance/m7i_m6i_comp_gptj6b.png) The LLM inference performances on M7i and M6i instances are compared based on the above results. M7i, with the 4th Gen Xeon® processors, has a remarkable performance advantage over M6i with the 3rd Gen Xeon® processors. M7i performance boost ratio over M6i for non-quantized (BF16 or FP32) models: | | Speedup | Throughput | |:----------:|:-------:|:----------:| | LLaMA2 7B | 2.47x | 2.62x | | LLaMA2 13B | 2.57x | 2.62x | | GPT-J 6B | 2.58x | 2.85x | M7i performance boost ratio over M6i for INT8 quantized models: | | Speedup | Throughput | |:----------:|:-------:|:----------:| | LLaMA2 7B | 1.27x | 1.38x | | LLaMA2 13B | 1.27x | 1.27x | | GPT-J 6B | 1.29x | 1.36x | We can also conclude that **with a larger batch size the capacity of the model service can be improved at the cost of longer response latency for the individual sessions**. The following table exhibits that for INT8 quantized LLaMA2-7b model on M7i instances, input batch_size=8 would increase the total throughput by 6.47x compared with batch_size=1, whereas P90 token latency gets 1.26x longer. | Batch size | Decoder latency | Total tokens per sec | |:----------:|:---------------:|:--------------------:| | 1 | 39 | 26.32 | | 8 | 49 | 170.21 | | | | | |***Ratio*** | 1.26x | 6.47x | *Note:* Measured by Intel on 17th Aug 2023; M7i.16xLarge, M6i.16xLarge instances in US-west-2. OS-Ubuntu 22.04-lts, kernel 6.20.0-1009-aws, SW: PyTorch* 2.1 and Intel® Extension for PyTorch* 2.1/llm_feature_branch. ## INT8 with v1.11 ### Performance Numbers
Hardware Workload1 Precision Throughput Inference2 Realtime Inference3 Model Type Dataset Input Data Shape Tunable Parameters
Batch Size Boost Ratio Batch Size Boost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz ResNet50 INT8 80 1.83x 1 1.44x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34 INT8 80 2.16x 1 1.83x Computer Vision COCO Input shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16d INT8 80 1.81x 1 1.21x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11 INT8 80 1.75x 1 1.19x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0 INT8 80 2.07x 1 1.47x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
BERT-Large INT8 80 2.78x 1 2.04x NLP Squad max_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-Base INT8 80 2.05x 1 1.96x NLP MRPC max_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts
DistilBERT-Base INT8 80 2.12x 1 1.57x NLP Squad max_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts

1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.
*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. *Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. ### Accuracy
Workload Metric FP32 INT8 INT8/FP32
BERT-base_text_classification f1 0.81 0.81 99.79%
BERT-Large f1 93.16 93.02 99.85%
Distilbert-base f1 86.84 86.13 99.19%
ResNet50 Top1 76.15 75.98 99.78%
ResNext 32x16d Top1 84.17 84.05 99.86%
SSD-ResNet34 mAP 0.200 0.199 99.48%
VGG11 Top1 69.04 67.96 98.44%
Shufflenetv2_x1.0 Top1 69.36 67.92 97.93%1

1. ShuffleNet INT8 accuracy is expected to improve w/o performance trade-off via histogram calibration algorithm.
### Configuration #### Software Version | Software | Version | | :-: | :-: | | PyTorch | [v1.11.0](https://pytorch.org/get-started/locally/) | | Intel® Extension for PyTorch\* | [v1.11.0](https://github.com/intel/intel-extension-for-pytorch/releases) | #### Hardware Configuration | | 3rd Generation Intel® Xeon® Scalable Processors | | :-: | :-: | | CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | | Number of nodes | 1 | | Number of sockets | 2 | | Cores/Socket | 40 | | Threads/Core | 2 | | uCode | 0xd0002a0 | | Hyper-Threading | ON | | TurboBoost | ON | | BIOS version | 04.12.02 | | Number of DDR Memory slots | 16 | | Capacity of DDR memory per slot | 16GB | | DDR frequency | 3200 | | Total Memory/Node (DDR+DCPMM) | 256GB | | Host OS | CentOS Linux release 8.4.2105 | | Host Kernel | 4.18.0-305.10.2.el8\_4.x86\_64 | | Docker OS | Ubuntu 18.04.5 LTS | | [Spectre-Meltdown Mitigation](https://github.com/speed47/spectre-meltdown-checker) | Mitigated | ## FP32 with v1.11.200 on an AWS EC2 C6i.2xlarge instance ### Performance Numbers
Hardware Workload1 Precision Throughput Inference2 Real-time Inference3 Model Type Dataset Input Data Shape Tunable Parameters
Batch Size Boost Ratio Batch Size Boost Ratio
AWS EC2 C6i.2xlarge ResNet50 Float32 64 1.24x 1 1.31x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16d Float32 64 1.07x 1 1.05x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11 Float32 64 1.15x 1 1.21x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0 Float32 64 1.12x 1 1.30x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
MobileNet v2 Float32 64 1.08x 1 1.12x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
BERT-Large Float32 64 1.05x 1 1.03x NLP Squad max_seq_len=384
Task: Question Answering
Default memory allocator;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
Bert-Base Float32 64 1.08x 1 1.09x NLP MRPC max_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 128

1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.
*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. *Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. ### Configuration #### Software Version | Software | Version | | :-: | :-: | | PyTorch | [v1.11.0](https://pytorch.org/get-started/locally/) | | Intel® Extension for PyTorch\* | [v1.11.200](https://github.com/intel/intel-extension-for-pytorch/releases) | ## FP32 and BFloat16 with v1.10 ### Performance Numbers
Hardware Workload1 Precision Throughput Inference2 Real-time Inference3 Model Type Dataset Input Data Shape Tunable Parameters
Batch Size Boost Ratio Batch Size Boost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz ResNet50 Float32 80 1.39x 1 1.35x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
SSD-ResNet34 Float32 160 1.55x 1 1.06x Computer Vision COCO Input shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ResNext 32x16d Float32 80 1.08x 1 1.08x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
Faster R-CNN ResNet50 FPN Float32 80 1.71x 1 1.07x Computer Vision COCO Input shape
[3, 1200, 1200]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
VGG-11 Float32 160 1.20x 1 1.13x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
inference scripts
ShuffleNetv2_x1.0 Float32 160 1.32x 1 1.20x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
MobileNet v2 Float32 160 1.48x 1 1.12x Computer Vision ImageNet Input shape
[3, 224, 224]
Default memory allocator;
Intel(R) OpenMP;
DLRM Float32 80 1.11x 1 - Recommendation Terabyte - Default memory allocator;
Intel(R) OpenMP;
inference scripts
BERT-Large Float32 80 1.14x 1 1.02x NLP Squad max_seq_len=384
Task: Question Answering
Default memory allocator;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 64
Bert-Base Float32 160 1.10x 1 1.33x NLP MRPC max_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts;
Recommend to set auto_kernel_selection to ON when seq_len exceeds 128
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz BERT-Large BFloat16 56 1.67x 1 1.45x NLP Squad max_seq_len=384
Task: Question Answering
Jemalloc;
Intel(R) OpenMP;
inference scripts
Bert-Base BFloat16 112 1.77x 1 1.18x NLP MRPC max_seq_len=128
Task: Text Classification
Jemalloc;
Intel(R) OpenMP;
inference scripts

1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.
*Note:* Performance numbers with stock PyTorch are measured with its most performant configuration. *Note:* Environment variable *DNNL_PRIMITIVE_CACHE_CAPACITY* is set to *1024*. ### Configuration #### Software Version | Software | Version | | :-: | :-: | | PyTorch | [v1.10.1](https://pytorch.org/get-started/locally/) | | Intel® Extension for PyTorch\* | [v1.10.100](https://github.com/intel/intel-extension-for-pytorch/releases) | #### Hardware Configuration | | 3rd Generation Intel® Xeon® Scalable Processors | Products formerly Cooper Lake | | :-: | :-: | :-: | | CPU | Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz | Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz | | Number of nodes | 1 | 1 | | Number of sockets | 2 | 2 | | Cores/Socket | 40 | 28 | | Threads/Core | 2 | 2 | | uCode | 0xd0002a0 | 0x700001c | | Hyper-Threading | ON | ON | | TurboBoost | ON | ON | | BIOS version | 04.12.02 | WLYDCRB1.SYS.0016.P29.2006080250 | | Number of DDR Memory slots | 16 | 12 | | Capacity of DDR memory per slot | 16GB | 64GB | | DDR frequency | 3200 | 3200 | | Total Memory/Node (DDR+DCPMM) | 256GB | 768GB | | Host OS | CentOS Linux release 8.4.2105 | Ubuntu 18.04.4 LTS | | Host Kernel | 4.18.0-305.10.2.el8\_4.x86\_64 | 4.15.0-76-generic | | Docker OS | Ubuntu 18.04.5 LTS | Ubuntu 18.04.5 LTS | | [Spectre-Meltdown Mitigation](https://github.com/speed47/spectre-meltdown-checker) | Mitigated | Mitigated |