Validated Models

Intel® Neural Compressor validated examples with multiple compression techniques. The typical examples link can be found in example tables, and the performance/accuracy results is available here.

  1. Validated Quantization Examples

    1.1. TensorFlow Models with Intel TensorFlow 2.13.0

    1.2. PyTorch Models with Torch 2.0.1+cpu in PTQ Mode

    1.3. PyTorch Models with Torch 2.0.1+cpu in QAT Mode

    1.4. PyTorch Models with Intel® Extension for PyTorch* 2.0.1+cpu

    1.5. PyTorch Models with Torch 2.0.1+cpu in WOQ Mode

    1.6. ONNX Models with ONNX Runtime 1.15.1

    1.7. ONNX Models with ONNX Runtime 1.15.0 in WOQ Mode

  2. Validated Pruning Examples

  3. Validated Knowledge Distillation Examples

  4. Validated ONNX QDQ INT8 Models on Multiple Hardware through ONNX Runtime

Validated Quantization Examples

System summary: Test by Intel on 09/01/2023. 1-node, 1x Intel(R) Xeon(R) Platinum 8480+ @3.8GHz, 56 cores/socket, HT On, Turbo On, Total Memory 256GB (16x16GB DDR5 4800 MT/s [4800 MT/s]), BIOS 3A14.TEL2P1, microcode 0x2b0001b0,
CentOS Stream 8, gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), DL Models, Frameworks: TensorFlow/ONNXRT/PyTorch, Datatype: FP32/INT8/BF16.
Using 1 socket, 4 cores/instance, 14 instances and batch size 1 to benchmark most of the model.
Using 1 socket, 56 cores/instance, 1 instance and batch size 1 for some large models performance measurement.

Performance varies by use, configuration and other factors.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

TensorFlow Models with Intel TensorFlow 2.13.0

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet50 v1.0 pb 74.12% 74.27% -0.21% 2914.42 621.91 4.69x
ResNet50 v1.5 pb 76.23% 76.46% -0.31% 2160.07 545.47 3.96x
ResNet101 pb 77.50% 76.45% 1.37% 1508.97 428.02 3.53x
Inception V1 pb 70.44% 69.74% 1.01% 3290.75 1229.78 2.68x
Inception V2 pb 74.38% 73.97% 0.57% 2404.57 1048.49 2.29x
Inception V3 pb 76.71% 76.75% -0.05% 1669.09 500.95 3.33x
Inception V4 pb 80.18% 80.27% -0.11% 1073.14 245.13 4.38x
Inception ResNet V2 pb 80.34% 80.40% -0.07% 374.52 172.06 2.18x
MobileNet V1 pb 71.78% 70.96% 1.16% 5478.88 1756.33 3.12x
MobileNet V2 pb 72.52% 71.76% 1.07% 4133.01 1748.06 2.36x
VGG16 pb 72.64% 70.89% 2.47% 1534.50 236.62 6.49x
VGG19 pb 72.69% 71.01% 2.37% 1377.40 197.77 6.96x
ResNetV2 50 pb 70.39% 69.64% 1.07% 1125.32 656.38 1.71x
ResNetV2 101 pb 72.62% 71.87% 1.04% 709.50 367.00 1.93x
ResNetV2 152 pb 73.11% 72.37% 1.03% 497.24 265.34 1.87x
Densenet 121 pb 73.59% 72.89% 0.97% 557.67 456.61 1.22x
Densenet 161 pb 76.35% 76.29% 0.08% 353.18 235.35 1.50x
Densenet 169 pb 74.34% 74.65% -0.41% 435.44 385.73 1.13x
EfficientNet B0 ckpt 76.15% 76.76% -0.79% 786.55 723.69 1.09x
SSD ResNet50 V1 pb 37.88% 38.00% -0.31% 130.09 30.78 4.23x
SSD MobileNet V1 pb 22.98% 23.13% -0.64% 1291.02 683.50 1.89x
SSD ResNet50 v1 ckpt 37.89% 38.00% -0.30% 127.30 27.63 4.61x
SSD MobileNet v1 ckpt 22.96% 23.13% -0.72% 1295.23 453.76 2.85x
SSD ResNet34 pb 21.70% 22.09% -1.76% 242.91 14.03 17.31x
Faster R-CNN Inception ResNet V2 pb 37.47% 38.31% -2.18% 5.44 3.02 1.80x
Faster R-CNN Inception ResNet V2 SavedModel 37.79% 38.31% -1.34% 5.43 3.00 1.81x
Faster R-CNN ResNet101 pb 30.32% 30.39% -0.23% 166.37 23.54 7.07x
Faster R-CNN ResNet101 SavedModel 30.33% 30.39% -0.20% 151.54 18.58 8.16x
Faster R-CNN ResNet50 pb 26.64% 26.59% 0.21% 173.33 28.58 6.07x
YOLOv3 pb 82.13% 82.35% -0.28% 230.69 88.35 2.61x
BERT large SQuAD pb 92.36 92.99 -0.67% 59.76 17.71 3.37x
BERT large SQuAD (ONNX Model Zoo) pb 92.26 92.98 -0.78% 41.65 16.14 2.58x
BERT base MRPC ckpt 87.01% 86.52% 0.57% 416.57 177.06 2.35x
Transformer LT pb 25.68 25.86 -0.67% 41.19 21.94 1.88x
Transformer lt MLPerf pb 27.27 27.17 0.39% 9.77 4.51 2.17x
Wide Deep large DS pb 77.75% 77.67% 0.10% 75552.26 50803.82 1.49x
Model Example Accuracy Performance 1s56c1ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
Mask R-CNN Inception V2 pb 28.60% 28.73% -0.44% 41.96 25.66 1.64x
Mask R-CNN Inception V2 ckpt 28.60% 28.73% -0.44% 41.56 24.35 1.71x

PyTorch Models with Torch 2.0.1+cpu in PTQ Mode

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet18 static 69.61% 69.76% -0.22% 1673.05 653.13 2.56x
ResNet50 static 75.92% 76.15% -0.30% 1170.62 329.70 3.55x
Inception V3 static 69.47% 69.52% -0.07% 977.08 335.55 2.91x
ResNeSt50 static 80.80% 81.04% -0.30% 404.51 40.04 10.10x
ResNeXt101_32x8d static 78.94% 79.31% -0.46% 562.16 109.77 5.12x
Efficientnet_b0 static 76.89% 77.67% -1.01% 696.79 667.27 1.04x
Efficientnet_b3 static 77.82% 78.54% -0.93% 508.85 397.32 1.28x
Efficientnet_b7 static 73.55% 73.92% -0.50% 234.87 149.65 1.57x
Peleenet static 71.85% 72.10% -0.35% 858.18 588.33 1.46x
SE_ResNeXt50_32x4d static 79.03% 79.08% -0.07% 739.61 283.60 2.61x
YOLO V3 static 55.09% 54.93% 0.31% 161.92 60.48 2.68x
SSD ResNet34 static 19.52 19.63 -0.58% 141.26 11.78 11.99x
Roberta base MRPC static 92.69% 93.59% -0.96% 404.62 174.02 2.33x
CamemBERT base MRPC static 88.93% 89.28% -0.39% 395.08 171.78 2.30x
DistilBERT base MRPC static 89.53% 90.27% -0.82% 795.98 341.60 2.33x
DistilBERT base MRPC dynamic 90.20% 90.27% -0.07% 744.78 343.36 2.17x
ALBERT base MRPC static 92.63% 92.63% 0.00% 374.41 163.39 2.29x
Funnel MRPC static 91.60% 92.25% -0.71% 300.02 182.21 1.65x
Xlm Roberta MRPC static 88.36% 88.62% -0.29% 399.27 173.62 2.30x
Xlm Roberta MRPC dynamic 88.24% 88.24% 0.00% 385.00 174.37 2.21x
BERT base MRPC static 89.63% 90.42% -0.87% 407.79 173.24 2.35x
BERT base COLA static 54.51% 53.39% 2.10% 412.12 172.97 2.38x
BERT base STSB static 87.55% 88.05% -0.57% 413.19 173.17 2.39x
BERT base SST-2 static 91.51% 92.32% -0.87% 409.94 172.77 2.37x
BERT large COLA static 62.84% 63.35% -0.80% 141.90 51.55 2.75x
BERT base RTE static 72.56% 72.56% 0.00% 401.42 174.02 2.31x
BERT large MRPC static 90.22% 90.38% -0.17% 139.59 51.66 2.70x
BERT large QNLI static 90.87% 91.54% -0.74% 406.48 172.94 2.35x
BERT large RTE static 73.29% 74.01% -0.98% 141.92 51.41 2.76x
BERT large RTE dynamic 71.48% 74.01% -3.41% 128.46 51.61 2.49x
BERT large SQuAD static 92.27 93.16 -0.95% 37.59 16.48 2.28x
Reformer Crime and Punishment static 1.88 1.87 0.23% 446.29 398.25 1.12x
lvwerra/pegasus-samsum static 42.50 42.67 -0.39% 102.63 37.94 2.71x
T5 Small dynamic 2.65 3.16 -16.25% 770.18 450.79 1.71x
Model Example Accuracy Performance 1s56c1ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
EleutherAI/gpt-j-6B static 3.36 2.34 43.85% 0.88 0.28 3.14x
openai/whisper-large dynamic 97.07% 96.96% 0.12% 0.59 0.47 1.25x
abeja/gpt-neox-japanese-2.7b static 4.30 3.52 22.06% 1.04 0.55 1.90x

PyTorch Models with Torch 2.0.1+cpu in QAT Mode

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet18 static 69.74% 69.76% -0.03% 1646.74 657.43 2.50x
ResNet50 static 76.05% 76.15% -0.12% 1098.80 322.34 3.41x
ResNeXt101_32x8d static 79.28% 79.31% -0.04% 568.02 109.50 5.19x
MobileNet V2 static 69.73% 71.84% -2.93% 1383.77 761.35 1.82x
BERT base MRPC static 89.50% 90.40% -1.00% 401.83 173.17 2.32x

PyTorch Models with Intel® Extension for PyTorch* 2.0.1+cpu

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet18 static 69.56% 69.76% -0.29% 5701.04 1593.88 3.58x
ResNet50 static 75.98% 76.15% -0.22% 2090.03 685.29 3.05x
ResNeXt101_32x16d_wsl static 84.04% 84.17% -0.15% 556.86 79.42 7.01x
SSD ResNet34 static 19.93% 20.00% -0.38% 91.53 15.62 5.86x
bert-large-uncased-whole-word-masking-finetuned-squad static 92.93 93.16 -0.25% 162.94 22.37 7.29x
distilbert-base-uncased-distilled-squad static 86.09 86.84 -0.86% 558.66 151.25 3.69x
Model Example Accuracy Performance 1s56c1ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
EleutherAI/gpt-j-6B static 78.70% 79.20% -0.63% 4.88 1.57 3.11x

PyTorch Models with Torch 2.0.1+cpu in WOQ Mode

Model name Configuration Lambada_openai Hellaswag Winogrande Piqa Average
[Mean accuracy of previous four tasks]
Wikitext
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Ratio
[INT4/FP32]
Word_perplexity
EleutherAI/gpt-j-6b FP32 0.6831 0.4954 0.6409 0.7541 0.6434 / 10.8816
GPTQ
W4G128Asym
0.679 0.4895 0.6433 0.7476 0.6399 0.9945 11.0999
GPTQ
W4G32Asym
0.6829 0.4923 0.6401 0.7486 0.6410 0.9963 11.0141
GPTQ
W4G128Sym
0.685 0.4907 0.6361 0.7443 0.6390 0.9932 11.1498
GPTQ
W4G32Sym
0.6911 0.4899 0.6448 0.7497 0.6439 1.0008 11.0927
facebook/opt-6.7b FP32 0.6769 0.5049 0.6543 0.7628 0.6497 / 12.2862
GPTQ
W4G32Asym
0.6804 0.4984 0.6535 0.7568 0.6473 0.9962 12.4193
GPTQ
W4G32Sym
0.6885 0.4973 0.6433 0.753 0.6455 0.9935 12.4607
decapoda-research/llama-7b-hf FP32 0.7361 0.5642 0.6709 0.7835 0.6887 / 9.4202
GPTQ
W4G32Asym
0.7244 0.5603 0.6614 0.7835 0.6824 0.9909 9.5881
decapoda-research/llama-13b-hf FP32 0.7627 0.5911 0.7009 0.7878 0.7106 / 8.212
GPTQ
W4G128Asym
0.7518 0.5843 0.6961 0.7911 0.7058 0.9932 8.4319
GPTQ
W4G32Asym
0.7572 0.5898 0.7056 0.7894 0.7105 0.9998 8.3429
GPTQ
W4G128Sym
0.7596 0.5841 0.6977 0.7905 0.7080 0.9963 8.4916
decapoda-research/llama-30b-hf FP32 0.7759 0.6266 0.7277 0.8096 0.7350 / 6.2384
GPTQ
W4G128Asym
0.778 0.624 0.7269 0.8047 0.7334 0.9979 6.4237
GPTQ
W4G32Asym
0.7706 0.6239 0.7285 0.8058 0.7322 0.9963 6.4697
GPTQ
W4G128Sym
0.7836 0.6195 0.7269 0.8047 0.7337 0.9983 6.5604
meta-llama/Llama-2-7b-chat-hf FP32 0.7058 0.5732 0.648 0.7715 0.6746 / 11.7107
GPTQ
W4G128Asym
0.6982 0.5637 0.6527 0.7704 0.6713 0.9950 11.9702
GPTQ
W4G32Asym
0.6953 0.5682 0.6575 0.7758 0.6742 0.9994 11.9317
meta-llama/Llama-2-7b-hf FP32 0.7392 0.567 0.6709 0.7835 0.6902 / 8.7911
GPTQ
W4G32Asym
0.7353 0.5642 0.6622 0.7829 0.6862 0.9942 8.9635
GPTQ
W4G128Sym
0.7246 0.5617 0.6756 0.7797 0.6854 0.9931 9.2799
meta-llama/Llama-2-13b-chat-hf FP32 0.7312 0.6059 0.7103 0.7835 0.7077 / 10.2213
GPTQ
W4G128Asym
0.7273 0.6018 0.7088 0.7742 0.7030 0.9934 2538.083
GPTQ
W4G32Asym
0.7283 0.6053 0.7024 0.7764 0.7031 0.9935 1889.374
GPTQ
W4G128Sym
0.727 0.5997 0.7024 0.778 0.7018 0.9916 2504.497
meta-llama/Llama-2-13b-hf FP32 0.7677 0.5972 0.6961 0.7878 0.7122 / 7.8984
GPTQ
W4G128Asym
0.7627 0.5933 0.689 0.7851 0.7075 0.9934 1556.448
GPTQ
W4G32Asym
0.7675 0.5934 0.6977 0.7856 0.7111 0.9984 1514.927
GPTQ
W4G128Sym
0.7566 0.5899 0.7032 0.7856 0.7088 0.9953 1374.728
bigscience/bloom-7b1 FP32 0.5764 0.4628 0.6456 0.7269 0.6029 / 30.6438
GPTQ
W4G32Sym
0.5799 0.4542 0.6361 0.7312 0.6004 0.9957 32.0626
bigscience/bloomz-7b1 FP32 0.5593 0.4789 0.6527 0.7628 0.6134 / 51.7432
GPTQ
W4G32Asym
0.5525 0.4731 0.6504 0.7617 0.6094 0.9935 52.7828
databricks/dolly-v1-6b FP32 0.6866 0.5098 0.6433 0.7622 0.6505 / 11.3242
GPTQ
W4G128Asym
0.6878 0.5058 0.6393 0.7633 0.6491 0.9978 11.5514
GPTQ
W4G32Asym
0.6864 0.5084 0.6519 0.7568 0.6509 1.0006 11.4728
GPTQ
W4G128Sym
0.6876 0.5045 0.6433 0.7541 0.6474 0.9952 11.6474
databricks/dolly-v2-7b FP32 0.6379 0.5282 0.614 0.7448 0.6312 / 16.161
GPTQ
W4G32Asym
0.6377 0.5228 0.5991 0.7448 0.6261 0.9919 16.4096
EleutherAI/gpt-neo-2.7b FP32 0.6224 0.4271 0.577 0.722 0.5871 / 13.9359
GPTQ
W4G128Asym
0.6123 0.4227 0.5738 0.7203 0.5823 0.9917 14.3377
GPTQ
W4G32Asym
0.615 0.4259 0.5714 0.7247 0.5843 0.9951 14.2083
GPTQ
W4G32Sym
0.6154 0.4208 0.5777 0.7198 0.5834 0.9937 14.3121
EleutherAI/gpt-neox-20b FP32 0.7233 0.5359 0.6614 0.7753 0.6740 / 9.195
GPTQ
W4G128Asym
0.7186 0.5328 0.6535 0.7699 0.6687 0.9922 9.3463
GPTQ
W4G32Asym
0.7268 0.533 0.659 0.7715 0.6726 0.9979 9.2897
mosaicml/mpt-7b FP32 0.7056 0.5718 0.6859 0.7927 0.6890 / 9.9324
GPTQ
W4G128Asym
0.7006 0.5655 0.6803 0.7965 0.6857 0.9952 10.1515
mosaicml/mpt-7b-chat FP32 0.655 0.5752 0.6748 0.7845 0.6724 / 13.5951
GPTQ
W4G128Asym
0.6472 0.5716 0.6685 0.784 0.6678 0.9932 13.8539
mosaicml/mpt-7b-instruct FP32 0.6918 0.5819 0.678 0.7927 0.6861 / 10.8863
GPTQ
W4G128Asym
0.6864 0.5765 0.6827 0.7873 0.6832 0.9958 11.1451
mosaicml/mpt-7b-storywriter FP32 0.693 0.5477 0.663 0.784 0.6719 / 9.9125
GPTQ
W4G128Asym
0.6854 0.5443 0.6661 0.7813 0.6693 0.9961 10.1137
tiiuae/falcon-rw-7b FP32 0.6604 0.5419 0.6598 0.7753 0.6594 / 11.7616
GPTQ
W4G128Asym
0.6484 0.5369 0.6575 0.7807 0.6559 0.9947 11.9411
GPTQ
W4G32Asym
0.6571 0.5398 0.6582 0.7764 0.6579 0.9978 11.8809
GPTQ
W4G128Sym
0.652 0.535 0.6575 0.7682 0.6532 0.9906 12.0048
tiiuae/falcon-7b-instruct FP32 0.6437 0.5177 0.6669 0.7824 0.6527 / 14.5053
GPTQ
W4G128Asym
0.6301 0.5142 0.6654 0.7835 0.6483 0.9933 14.8146
GPTQ
W4G32Asym
0.6377 0.517 0.6598 0.7807 0.6488 0.9941 14.6953

ONNX Models with ONNX Runtime 1.15.1

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet50 V1.5 qlinearops 72.16% 72.29% -0.19% 1566.70 724.89 2.16x
ResNet50 V1.5 qdq 72.14% 72.29% -0.22% 1567.15 716.57 2.19x
ResNet50 V1.5 MLPerf qlinearops 76.11% 76.46% -0.46% 1414.92 718.25 1.97x
ResNet50 V1.5 MLPerf qdq 76.13% 76.46% -0.44% 1459.45 721.54 2.02x
ResNet50 V1.5 (ONNX Model Zoo) qlinearops 74.82% 74.99% -0.22% 1593.71 753.89 2.11x
ResNet50 V1.5 (ONNX Model Zoo) qdq 74.82% 74.99% -0.23% 1582.24 752.38 2.10x
MobileNet V2 qlinearops 65.49% 66.89% -2.09% 7139.93 4289.29 1.66x
MobileNet V2 qdq 65.49% 66.89% -2.10% 7335.80 4080.31 1.80x
MobileNet V2 (ONNX Model Zoo) qlinearops 68.38% 69.48% -1.59% 7236.84 4299.29 1.68x
MobileNet V2 (ONNX Model Zoo) qdq 68.38% 69.48% -1.59% 6842.58 4496.44 1.52x
VGG16 qlinearops 66.56% 66.69% -0.19% 591.43 178.91 3.31x
VGG16 qdq 66.59% 66.69% -0.15% 614.91 183.79 3.35x
VGG16 (ONNX Model Zoo) qlinearops 72.33% 72.40% -0.09% 590.04 182.90 3.23x
VGG16 (ONNX Model Zoo) qdq 72.33% 72.40% -0.09% 614.75 179.93 3.42x
MobileNet V3 MLPerf qlinearops 75.56% 75.74% -0.24% 5703.81 2578.80 2.21x
MobileNet V3 MLPerf qdq 75.56% 75.74% -0.24% 5610.37 2603.41 2.16x
ShuffleNet V2 (ONNX Model Zoo) qlinearops 66.09% 66.36% -0.41% 6689.57 3690.63 1.81x
ShuffleNet V2 (ONNX Model Zoo) qdq 66.09% 66.36% -0.41% 5692.38 3758.23 1.51x
GoogleNet (ONNX Model Zoo) qlinearops 67.71% 67.79% -0.12% 1792.52 1111.26 1.61x
GoogleNet (ONNX Model Zoo) qdq 67.73% 67.79% -0.09% 1821.10 1104.52 1.65x
SqueezeNet (ONNX Model Zoo) qlinearops 56.54% 56.87% -0.57% 9472.72 5582.40 1.70x
SqueezeNet (ONNX Model Zoo) qdq 56.54% 56.87% -0.57% 9861.50 5566.72 1.77x
CaffeNet (ONNX Model Zoo) qlinearops 56.21% 56.30% -0.16% 3348.37 1141.01 2.93x
CaffeNet (ONNX Model Zoo) qdq 56.25% 56.30% -0.09% 3509.70 1142.19 3.07x
AlexNet (ONNX Model Zoo) qlinearops 54.73% 54.79% -0.10% 2426.58 987.34 2.46x
AlexNet (ONNX Model Zoo) qdq 54.71% 54.79% -0.14% 2208.63 1016.53 2.17x
ZFNet (ONNX Model Zoo) qlinearops 55.84% 55.96% -0.21% 930.06 532.61 1.75x
ZFNet (ONNX Model Zoo) qdq 55.86% 55.96% -0.18% 919.83 417.00 2.21x
Inception V1 (ONNX Model Zoo) qlinearops 67.21% 67.24% -0.05% 1880.94 1159.97 1.62x
Inception V1 (ONNX Model Zoo) qdq 67.21% 67.24% -0.05% 1798.96 1151.37 1.56x
EfficientNet (ONNX Model Zoo) qlinearops 76.98% 77.11% -0.17% 2890.97 1380.23 2.09x
EfficientNet (ONNX Model Zoo) qdq 76.99% 77.11% -0.16% 2548.20 1362.69 1.87x
DenseNet (ONNX Model Zoo) qlinearops 60.53% 60.96% -0.70% 657.12 507.94 1.29x
SSD (ONNX Model Zoo) qlinearops 18.47% 18.98% -2.69% 57.63 14.64 3.94x
SSD (ONNX Model Zoo) qdq 18.62% 18.98% -1.89% 56.96 14.58 3.91x
SSD MobileNet V1 qlinearops 22.44% 23.10% -2.86% 1286.79 904.83 1.42x
SSD MobileNet V1 qdq 22.44% 23.10% -2.86% 1121.02 856.82 1.31x
SSD MobileNet V1 (ONNX Model Zoo) qlinearops 22.96% 23.02% -0.27% 1098.80 829.55 1.32x
SSD MobileNet V1 (ONNX Model Zoo) qdq 22.96% 23.02% -0.27% 1044.34 790.39 1.32x
SSD MobileNet V2 qlinearops 23.87% 24.67% -3.25% 849.89 627.62 1.35x
YOLOv3 (ONNX Model Zoo) qlinearops 27.01% 28.73% -5.99% 66.22 83.98 0.79x
YOLOv4 (ONNX Model Zoo) qlinearops 32.30% 33.71% -4.19% 70.87 66.16 1.07x
DUC (ONNX Model Zoo) qlinearops 81.63% 81.92% -0.36% 9.15 4.90 1.87x
Tiny YOLOv3 (ONNX Model Zoo) qlinearops 11.74% 12.42% -5.48% 1119.16 161.90 6.91x
Ultra Face (ONNX Model Zoo) qlinearops 83.17% 83.65% -0.57% 8537.50 1934.53 4.41x
Emotion FERPlus (ONNX Model Zoo) qlinearops 7.97% 8.00% -0.35% 3568.69 3121.38 1.14x
ArcFace (ONNX Model Zoo) qlinearops 99.80% 99.80% 0.00% 494.07 244.21 2.02x
BERT base MRPC qlinearops 85.54% 86.03% -0.57% 398.76 226.09 1.76x
BERT base MRPC qdq 85.54% 86.03% -0.57% 392.94 223.06 1.76x
BERT base MRPC integerops 85.29% 86.03% -0.85% 473.72 223.12 2.12x
DistilBERT base MRPC qdq 84.07% 84.56% -0.58% 548.57 400.62 1.37x
DistilBERT base MRPC integerops 85.54% 84.56% 1.16% 964.62 400.86 2.41x
Mobile bert MRPC qdq 85.54% 86.28% -0.85% 540.59 394.98 1.37x
Mobile bert MRPC integerops 85.54% 86.28% -0.85% 602.34 397.35 1.52x
Roberta base MRPC integerops 90.93% 89.95% 1.09% 487.62 222.08 2.20x
BERT SQuAD (ONNX Model Zoo) integerops 80.29 80.67 -0.47% 189.27 97.40 1.94x
MobileBERT SQuAD MLPerf (ONNX Model Zoo) integerops 89.87 90.03 -0.17% 146.72 125.33 1.17x
BiDAF (ONNX Model Zoo) integerops 65.93% 66.08% -0.23% 2757.59 2277.14 1.21x
GPT2 lm head WikiText (ONNX Model Zoo) integerops 31.98 29.00 10.31% 15.47 9.78 1.58x
BERT base cased MRPC (HuggingFace) qlinearops 90.21% 90.42% -0.23% 360.90 212.41 1.70x
BERT base uncased MRPC (HuggingFace) integerops 89.58% 90.42% -0.93% 484.68 212.34 2.28x
Roberta base MRPC (HuggingFace) qlinearops 91.00% 91.38% -0.41% 353.24 213.83 1.65x
Roberta base MRPC (HuggingFace) integerops 90.85% 91.38% -0.58% 490.42 212.57 2.31x
XLM Roberta base MRPC (HuggingFace) qlinearops 89.37% 90.10% -0.81% 304.10 214.51 1.42x
XLM Roberta base MRPC (HuggingFace) integerops 89.66% 90.10% -0.50% 347.25 214.13 1.62x
Camembert base MRPC (HuggingFace) qlinearops 89.28% 89.28% 0.00% 272.62 216.98 1.26x
Camembert base MRPC (HuggingFace) integerops 89.19% 89.28% -0.10% 489.58 216.06 2.27x
MiniLM L12 H384 uncased MRPC (HuggingFace) qlinearops 90.13% 90.97% -0.93% 1054.31 585.78 1.80x
MiniLM L12 H384 uncased MRPC (HuggingFace) integerops 91.07% 90.97% 0.10% 1072.47 590.03 1.82x
DistilBERT base uncased SST-2 (HuggingFace) qlinearops 90.71% 91.06% -0.38% 890.23 398.72 2.23x
DistilBERT base uncased SST-2 (HuggingFace) integerops 90.25% 91.06% -0.88% 746.66 397.78 1.88x
Albert base v2 SST-2 (HuggingFace) qlinearops 92.09% 92.32% -0.25% 268.37 211.96 1.27x
Albert base v2 SST-2 (HuggingFace) integerops 91.74% 92.32% -0.62% 265.65 212.21 1.25x
MiniLM L6 H384 uncased SST-2 (HuggingFace) qlinearops 89.45% 90.14% -0.76% 1958.82 1130.40 1.73x
MiniLM L6 H384 uncased SST-2 (HuggingFace) integerops 89.91% 90.14% -0.26% 2022.09 1130.14 1.79x
MiniLM L6 H384 uncased SST-2 (HuggingFace) qlinearops 87.70% 88.29% -0.67% 397.45 212.84 1.87x
MiniLM L6 H384 uncased SST-2 (HuggingFace) integerops 88.19% 88.29% -0.12% 489.19 213.14 2.30x
Electra small discriminator MRPC (HuggingFace) qlinearops 89.92% 89.83% 0.09% 1797.98 1077.51 1.67x
Electra small discriminator MRPC (HuggingFace) integerops 89.27% 89.83% -0.63% 1930.55 1139.74 1.69x
BERT mini MRPC (HuggingFace) qlinearops 86.21% 86.52% -0.35% 5510.81 3334.89 1.65x
BERT mini MRPC (HuggingFace) integerops 86.16% 86.52% -0.41% 5627.19 3365.08 1.67x
Xlnet base cased MRPC (HuggingFace) qlinearops 90.05% 89.86% 0.21% 108.83 92.24 1.18x
Xlnet base cased MRPC (HuggingFace) integerops 89.58% 89.86% -0.31% 110.83 90.80 1.22x
BART large MRPC (HuggingFace) qlinearops 91.77% 91.20% 0.63% 59.18 51.49 1.15x
BART large MRPC (HuggingFace) integerops 92.36% 91.20% 1.28% 96.38 51.47 1.87x
DeBERTa v3 base MRPC (HuggingFace) qlinearops 91.85% 92.23% -0.40% 163.17 146.13 1.12x
DeBERTa v3 base MRPC (HuggingFace) integerops 92.39% 92.23% 0.17% 168.41 145.58 1.16x
Spanbert SQuAD (HuggingFace) qlinearops 91.14 91.98 -0.91% 69.53 42.72 1.63x
Spanbert SQuAD (HuggingFace) integerops 91.40 91.98 -0.63% 79.82 42.58 1.87x
Bert base multilingual cased SQuAD (HuggingFace) qlinearops 88.42 89.13 -0.79% 70.47 42.73 1.65x
Bert base multilingual cased SQuAD (HuggingFace) integerops 88.70 89.13 -0.48% 79.35 42.46 1.87x
DistilBert base uncased SQuAD (HuggingFace) qlinearops 86.33 86.86 -0.62% 113.00 67.85 1.67x
DistilBert base uncased SQuAD (HuggingFace) integerops 86.05 86.86 -0.94% 159.51 67.90 2.35x
BERT large uncased whole word masking SQuAD (HuggingFace) qlinearops 92.34 93.16 -0.88% 24.64 12.75 1.93x
BERT large uncased whole word masking SQuAD (HuggingFace) integerops 92.99 93.16 -0.18% 26.79 12.76 2.10x
Roberta large SQuAD v2 (HuggingFace) qlinearops 89.03 89.02 0.02% 16.91 12.98 1.30x
Roberta large SQuAD v2 (HuggingFace) integerops 89.04 89.02 0.02% 26.80 12.95 2.07x
GPT2 WikiText (HuggingFace) qlinearops 30.25 29.00 4.33% 12.82 9.80 1.31x
GPT2 WikiText (HuggingFace) integerops 29.68 29.00 2.36% 13.68 9.76 1.40x
DistilGPT2 WikiText (HuggingFace) qlinearops 44.93 43.43 3.46% 20.66 16.78 1.23x
DistilGPT2 WikiText (HuggingFace) integerops 44.62 43.43 2.74% 21.97 16.77 1.31x
LayoutLM FUNSD (HuggingFace) qlinearops 78.15% 78.35% -0.25% 59.50 42.98 1.38x
LayoutLM FUNSD (HuggingFace) integerops 77.58% 78.35% -0.98% 64.93 43.20 1.50x
LayoutLMv3 FUNSD (HuggingFace) qlinearops 90.00% 90.49% -0.54% 30.97 27.97 1.11x
LayoutLMv3 FUNSD (HuggingFace) integerops 90.07% 90.49% -0.46% 35.15 27.72 1.27x
LayoutLMv2 (HuggingFace) qlinearops 81.36% 81.17% 0.23% 48.61 38.93 1.25x
LayoutLMv2 (HuggingFace) integerops 80.86% 81.17% -0.39% 45.52 36.10 1.26x
CodeBert (HuggingFace) qlinearops 64.97% 65.41% -0.67% 64.99 44.20 1.47x
CodeBert (HuggingFace) integerops 64.93% 65.41% -0.73% 77.99 43.63 1.79x
Model Example Accuracy Performance 1s56c1ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
Faster R-CNN (ONNX Model Zoo) qlinearops 34.06% 34.37% -0.88% 4.15 3.47 1.20x
Faster R-CNN (ONNX Model Zoo) qdq 33.98% 34.37% -1.12% 4.19 3.49 1.20x
Mask R-CNN (ONNX Model Zoo) qlinearops 33.13% 33.72% -1.74% 3.46 3.02 1.15x
Mask R-CNN (ONNX Model Zoo) qdq 33.29% 33.72% -1.28% 3.46 3.02 1.15x
FCN (ONNX Model Zoo) qlinearops 64.54% 64.98% -0.67% 28.04 12.59 2.23x
FCN (ONNX Model Zoo) qdq 64.54% 64.98% -0.67% 28.22 12.67 2.23x
GPT-J-6B (HuggingFace) qlinearops 78.46% 79.17% -0.91% 1.74 0.66 2.62x
GPT-J-6B (HuggingFace) integerops 78.93% 79.17% -0.31% 1.68 0.67 2.52x

ONNX Models with ONNX Runtime 1.15.0 in WOQ Mode

Model name Configuration Lambada_openai Accuracy Ratio
[INT4/FP32]
Accuracy Perplexity
meta-llama/Llama-2-7b-chat-hf FP32 0.7058 3.2788 /
GPTQ
W4G32Asym
0.7002 3.4124 0.9921
meta-llama/Llama-2-7b-hf FP32 0.7392 3.3950 /
GPTQ
W4G32Asym
0.7312 3.5711 0.9892
meta-llama/Llama-2-13b-chat-hf FP32 0.7312 2.9163 /
GPTQ
W4G128Asym
0.7240 2.9945 0.9902
meta-llama/Llama-2-13b-hf FP32 0.7677 3.0438 /
GPTQ
W4G128Asym
0.7634 3.1186 0.9944
GPTQ
W4G32Asym
0.7615 3.1276 0.9919
meta-llama/Llama-2-70b-chat-hf FP32 0.7543 2.6181 /
RTN
W4G32Asym
0.7518 2.6496 0.9967
meta-llama/Llama-2-70b-hf FP32 0.7964 2.6612 /
RTN
W4G32Sym
0.7941 2.7243 0.9971

Validated Pruning Examples

Model Task
Dataset
Dense Accuracy
Sparse Accuracy
Relative Drop Sparsity ratio
Sparsity Pattern
Comments
Balanced
or unbalanced ratio
Bert-Mini question answering
SQuAD-v1.1
f1=76.87
f1=76.2
-0.80% 80%
structured 4x1
snip momentum
unbalanced
Bert-Mini question answering
SQuAD-v1.1
f1=76.87
f1=76.2
-0.80% 80%
structured 4x1
snip momentum
unbalanced
Bert-Mini question answering
SQuAD-v1.1
f1=76.87
f1=77.62
+0.98% 50%
structured 2:4
snip momentum
balanced
Distilbert-base-uncased question answering
SQuAD-v1.1
f1=86.90
f1=86.15
-0.86% 80%
structured 4x1
snip momentum
unbalanced
Distilbert-base-uncased question answering
SQuAD-v1.1
f1=86.90
f1=87.50
+0.69% 50%
structured 2:4
snip momentum
balanced
Bert-base-uncased question answering
SQuAD-v1.1
f1=88.59
f1=87.78
-0.92% 80%
structured 4x1
snip momentum
unbalanced
Bert-base-uncased question answering
SQuAD-v1.1
f1=88.59
f1=89.40
+0.91% 50%
structured 2:4
snip momentum
balanced
Bert-large question answering
SQuAD-v1.1
f1=91.23
f1=90.91
-0.35% 80%
structured 4x1
snip momentum
unbalanced
Bert-large question answering
SQuAD-v1.1
f1=91.23
f1=91.67
+0.48% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
MRPC
f1=87.52
f1=87.22
-0.34% 90%
structured 4x1
snip momentum
unbalanced
Bert-Mini text classification
MRPC
f1=87.52
f1=87.33
-0.22% 90%
structured 4x1
snip momentum
balanced
Bert-Mini text classification
MRPC
f1=87.52
f1=86.89
-0.72% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
MRPC
f1=87.52
f1=86.8
-0.83% 60%
structured per channel
snip momentum
unbalanced
Distilbert-base-uncased text classification
MRPC
f1=90.26
f1=89.85
-0.46% 90%
structured 4x1
snip momentum
unbalanced
Distilbert-base-uncased text classification
MRPC
f1=90.26
f1=90.88
+0.69% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
SST-2
accuracy=87.61
accuracy=86.92
-0.79% 90%
structured 4x1
snip momentum
unbalanced
Bert-Mini text classification
SST-2
accuracy=87.61
accuracy=87.73
+0.14% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
SST-2
accuracy=87.61
accuracy=86.92
-0.79% 50%
structured per channel
snip momentum
unbalanced
ResNet50 image recognition
ImageNet
top1 acc = 78.95
top1 acc = 80.10
-1.43% 75%
structured 2x1
snip momentum
unbalanced
YOLO-v5s6 object detection
COCO
AP0.50:0.95/AP0.50=0.404/0.6
AP0.50:0.95/AP0.50=0.393/0.584
-2.72% 80%
unstructured
snip momentum
unbalanced
Bert-Large question answering
SQuAD-v1.1
f1=91.34
f1=90.7
-0.07% 80%
structured 2x1
group lasso
unbalanced
Bert-Base text classification
MNLI
[m, mm] = [84.57, 84.79]
[m, mm] = [82.45, 83.27]
[-2.51%, -1.80%] 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
MNLI
[m, mm] = [84.57, 84.79]
[m, mm] = [83.20, 84.11]
[-1.62%, -0.80%] 50%
structured 1:2
Prune once for all
balanced
Bert-Base text classification
SST-2
accuracy = 92.32
accuracy = 91.51
-0.88% 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
SST-2
accuracy = 92.32
accuracy = 92.20
-0.13% 50%
structured 1:2
Prune once for all
balanced
Bert-Base text classification
SST-2
accuracy = 92.32
accuracy = 91.97
-0.38% 20%
unstructured
gradient sensitivity
balanced
Bert-Base text classification
QQP
[accuracy, f1] = [91.10, 88.05]
[accuracy, f1] = [90.48, 87.06]
[-0.68%, -1.12%] 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
QQP
[accuracy, f1] = [91.10, 88.05]
[accuracy, f1] = [90.92, 87.78]
[-0.20%, -0.31%] 50%
structured 1:2
Prune once for all
balanced
Bert-Base text classification
QNLI
accuracy = 91.54
accuracy = 90.39
-1.26% 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
QNLI
accuracy = 91.54
accuracy = 90.87
-0.73% 50%
structured 1:2
Prune once for all
balanced
Bert-Base question answering [em, f1] = [79.34, 87.10]
[em, f1] = [77.27, 85.75]
[-2.61%, -1.54%] 70%
unstructured
Prune once for all
balanced
Bert-Base question answering [em, f1] = [79.34, 87.10]
[em, f1] = [78.03, 86.50]
[-1.65%, -0.69%] 50%
structured 1:2
Prune once for all
balanced

Validated Knowledge Distillation Examples

Example Name Dataset Student
(Metrics)
Teacher
(Metrics)
Student With Distillation
(Metrics Improvement)
Student With
Distributed Distillation
(Metrics Improvement)
MobileNet example CIFAR-10 MobileNetV2-0.35
(0.7965 ACC)
WideResNet40-2
(0.9522 ACC)
0.8178 ACC
(0.0213 ACC)
0.8235 ACC
(0.027 ACC)
CNN example CIFAR-100 CNN-2
(0.5494 ACC)
CNN-10
(0.7153 ACC)
0.5540 ACC
(0.0046 ACC)
0.5523 ACC
(0.0029 ACC)
VGG example CIFAR-100 VGG-8-BN
(0.7022 ACC)
VGG-13-BN
(0.7415 ACC)
0.7025 ACC
(0.0003 ACC)
NA
ResNet example ImageNet ResNet18
(0.6739 ACC)
ResNet50
(0.7399 ACC)
0.6845 ACC
(0.0106 ACC)
NA
BlendCnn example MRPC BlendCnn
(0.7034 ACC)
BERT-Base
(0.8382 ACC)
0.7034 ACC
(0 ACC)
NA
BiLSTM example SST-2 BiLSTM
(0.8314 ACC)
RoBERTa-Base
(0.9403 ACC)
0.9048 ACC
(0.0734 ACC)
NA
DistilBERT example SQuAD DistilBERT
(0.7323/0.8256 EM/F1)
BERT-Base
(0.8084/0.8814 EM/F1)
0.7442/0.8371 EM/F1
(0.0119/0.0115 EM/F1)
NA
TinyBERT example MNLI TinyBERT
(0.8018/0.8044 m/mm)
BERT-Base
(0.8363/0.8411 m/mm)
0.8025/0.8074 m/mm
(0.0007/0.0030 m/mm)
NA
BERT-3 example QQP BERT-3
(0.8626/0.8213 EM/F1)
BERT-Base
(0.9091/0.8782 EM/F1)
0.8684/0.8259 EM/F1
(0.0058/0.0046 EM/F1)
NA
DistilRoBERTa example COLA DistilRoBERTa
(0.6057 ACC)
RoBERTa-Large
(0.6455 ACC)
0.6187 ACC
(0.0130 ACC)
NA

Validated ONNX QDQ INT8 Models on Multiple Hardware through ONNX Runtime

Model (ONNX QDQ) AWS c6i.2xlarge (Intel)
CPU Execution Provider
AWS c6a.2xlarge (AMD)
CPU Execution Provider
AWS c6g.2xlarge (ARM)
CPU Execution Provider
NVidia A100
CUDA Execution
Provider
ResNet50 74.76% 68.95% 74.76% 74.75%
BERT-base 85.54% 84.56% 85.54% 84.31%
ResNet50 V1.5 72.20% 67.70% 72.20% 72.29%
MobileNet V2 65.82% 58.56% 65.83% 65.63%
SSD MobileNet V1 22.45% 16.53% 22.45% 22.35%
DistilBERT base MRPC 84.56% 83.82% 84.56% 84.56%
SqueezeNet 56.54% 53.52% 56.54% 56.55%
SSD 18.63% 18.54% 18.63% 18.61%
AlexNet 54.71% 47.06% 54.71% 54.79%
CaffeNet 56.25% 52.35% 56.27% 56.24%
GoogleNet 67.73% 63.56% 67.72% 67.76%
ZFNet 55.86% 45.09% 55.86% 55.89%
Inception V1 67.21% 63.03% 67.20% 67.21%
SSD MobileNet V1 (ONNX Model Zoo) 22.86% 16.94% 22.80% 22.87%
Mobile bert MRPC 85.54% 84.56% 85.54% 85.54%
Roberta base MRPC 89.46% 90.44% 89.71% 89.71%
ResNet50 V1.5 MLPerf 76.14% 72.80% 76.14% 76.17%
VGG16 66.69% 64.25% 66.69% 66.64%
VGG16 (ONNX Model Zoo) 72.31% 69.35% 72.32% 72.34%
MobileNet V3 MLPerf 75.57% 70.78% 75.56% 75.52%
EfficientNet 77.61% 76.52% 77.56% 77.60%
MobileNet V2 (ONNX Model Zoo) 68.51% 62.48% 68.58% 68.48%
ShuffleNet V2 66.12% 58.41% 66.11% 66.11%