Validated Models

Intel® Neural Compressor validated examples with multiple compression techniques. The typical examples link can be found in example tables, and the performance/accuracy results is available here.

  1. Validated Quantization Examples

    1.1. TensorFlow Models with TensorFlow 2.15.0

    1.2. PyTorch Models with Torch 2.2.1+cpu in PTQ Mode

    1.3. PyTorch Models with Torch 2.2.1+cpu in QAT Mode

    1.4. PyTorch Models with Torch 2.0.1+cpu in WOQ Mode

    1.5. ONNX Models with ONNX Runtime 1.17.1

    1.6. ONNX Models with ONNX Runtime 1.15.0 in WOQ Mode

  2. Validated Pruning Examples

  3. Validated Knowledge Distillation Examples

  4. Validated ONNX QDQ INT8 Models on Multiple Hardware through ONNX Runtime

Validated Quantization Examples

System summary: Test by Intel on 3/18/2024. 1-node, 1x Intel(R) Xeon(R) Platinum 8480+ @3.8GHz, 56 cores/socket, HT On, Turbo On, Total Memory 256GB (16x16GB DDR5 4800 MT/s [4800 MT/s]), BIOS 3A14.TEL2P1, microcode 0x2b0001b0,
CentOS Stream 8, gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), DL Models, Frameworks: TensorFlow/ONNXRT/PyTorch, Datatype: FP32/INT8/BF16.
Using 1 socket, 4 cores/instance, 14 instances and batch size 1 to benchmark most of the model.

Performance varies by use, configuration and other factors.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

TensorFlow Models with TensorFlow 2.15.0

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet50 v1.0 pb 74.11% 74.27% -0.22% 1720.00 582.18 2.95x
ResNet50 v1.5 pb 76.25% 76.46% -0.28% 1517.38 570.65 2.66x
ResNet101 pb 77.52% 76.45% 1.41% 1058.93 382.96 2.77x
Inception V1 pb 70.45% 69.74% 1.03% 2080.56 951.85 2.19x
Inception V2 pb 74.33% 73.97% 0.49% 1587.53 863.37 1.84x
Inception V3 pb 76.72% 76.75% -0.03% 1052.91 434.27 2.42x
Inception V4 pb 80.13% 80.27% -0.18% 707.41 234.38 3.02x
Inception ResNet V2 pb 80.25% 80.40% -0.18% 320.37 179.46 1.79x
MobileNet V1 pb 71.79% 70.96% 1.18% 4312.31 1512.59 2.85x
MobileNet V2 pb 72.48% 71.76% 1.01% 2287.77 1406.75 1.63x
VGG16 pb 72.69% 70.89% 2.55% 1367.34 207.41 6.59x
VGG19 pb 72.67% 71.01% 2.33% 1244.82 176.79 7.04x
ResNetV2 50 pb 70.37% 69.64% 1.05% 780.51 582.96 1.34x
ResNetV2 101 pb 72.64% 71.87% 1.08% 494.43 329.51 1.50x
ResNetV2 152 pb 73.12% 72.37% 1.04% 349.42 235.48 1.48x
Densenet   161 pb 76.29% 76.29% 0.00% 282.31 223.19 1.26x
SSD ResNet50 V1 pb 37.91% 38.00% -0.24% 139.49 30.99 4.50x
SSD MobileNet V1 pb 23.00% 23.13% -0.57% 1284.41 756.56 1.70x
SSD ResNet50 v1 ckpt 37.88% 38.00% -0.31% 139.56 27.79 5.02x
SSD MobileNet v1 ckpt 22.96% 23.13% -0.71% 1280.88 530.23 2.42x
Faster R-CNN ResNet101 pb 30.32% 30.39% -0.22% 161.19 23.80 6.77x
Faster R-CNN ResNet50 pb 26.61% 26.59% 0.09% 178.89 29.20 6.13x
YOLOv3 pb 83.28% 82.35% 1.12% 249.35 94.44 2.64x
BERT large SQuAD pb 92.44 92.99 -0.58% 46.54 20.37 2.28x
BERT large SQuAD (ONNX Model Zoo) pb 92.36 92.98 -0.67% 42.65 20.79 2.05x
BERT base MRPC ckpt 85.78% 86.52% -0.85% 390.36 212.96 1.83x
VIT pb 81.39% 81.92% -0.64% 230.91 142.24 1.62x

PyTorch Models with Torch 2.2.1+cpu in PTQ Mode

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet18 static 69.59% 69.76% -0.24% 1989.72 600.45 3.31x
ResNet50 static 75.98% 76.15% -0.21% 1165.92 303.91 3.84x
Inception V3 static 69.46% 69.52% -0.09% 953.35 302.52 3.15x
ResNeSt50 static 80.76% 81.04% -0.35% 365.44 39.66 9.21x
ResNeXt101_32x8d static 78.92% 79.31% -0.49% 548.78 104.14 5.27x
Efficientnet_b0 static 76.94% 77.67% -0.94% 636.62 566.42 1.12x
Efficientnet_b3 static 77.78% 78.54% -0.98% 471.61 358.59 1.32x
Peleenet static 71.83% 72.10% -0.37% 790.03 504.44 1.57x
YOLO V3 static 55.10% 54.93% 0.31% 162.98 57.37 2.84x
SSD ResNet34 static 19.48 19.63 -0.77% 137.89 11.61 11.88x
Roberta base MRPC static 92.97% 93.59% -0.66% 390.95 175.44 2.23x
CamemBERT base MRPC static 88.47% 89.28% -0.91% 393.70 174.51 2.26x
DistilBERT base MRPC static 90.30% 90.27% 0.04% 783.37 344.91 2.27x
DistilBERT base MRPC dynamic 90.02% 90.27% -0.28% 684.20 344.68 1.99x
ALBERT base MRPC static 92.63% 92.63% 0.00% 312.48 155.60 2.01x
Funnel   MRPC static 91.94% 92.25% -0.34% 281.83 179.04 1.57x
Xlm Roberta MRPC static 89.46% 88.62% 0.94% 395.91 173.59 2.28x
Xlm Roberta MRPC dynamic 88.54% 88.24% 0.35% 373.90 173.91 2.15x
BERT base MRPC static 89.56% 90.42% -0.95% 405.08 176.38 2.30x
BERT base COLA static 52.86% 53.39% -0.99% 395.37 177.37 2.23x
BERT base STSB static 87.39% 88.05% -0.74% 396.71 173.80 2.28x
BERT base SST-2 static 91.97% 92.32% -0.37% 393.20 173.65 2.26x
BERT large COLA static 62.80% 63.35% -0.88% 136.55 51.82 2.64x
BERT base RTE static 73.29% 72.56% 1.00% 377.79 173.84 2.17x
BERT large MRPC static 89.36% 90.38% -1.12% 136.72 51.87 2.64x
BERT large QNLI static 90.79% 91.54% -0.82% 391.67 173.82 2.25x
BERT large RTE static 73.29% 74.01% -0.98% 135.20 51.90 2.61x
BERT large RTE dynamic 73.29% 74.01% -0.98% 117.14 51.74 2.26x
BERT large SQuAD static 92.29 93.16 -0.93% 32.61 16.88 1.93x
lvwerra/pegasus-samsum static 42.32 42.67 -0.82% 93.80 37.59 2.50x

PyTorch Models with Torch 2.2.1+cpu in QAT Mode

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet18 static 69.74% 69.76% -0.03% 1981.66 598.39 3.31x
ResNet50 static 76.03% 76.15% -0.15% 1095.95 298.92 3.67x
ResNeXt101_32x8d static 79.31% 79.31% 0.00% 549.02 103.72 5.29x
BERT base MRPC static 89.40% 90.40% -1.11% 375.61 176.15 2.13x

PyTorch Models with Torch 2.0.1+cpu in WOQ Mode

Model name Configuration Lambada_openai Hellaswag Winogrande Piqa Average
[Mean accuracy of previous four tasks]
Wikitext
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Ratio
[INT4/FP32]
Word_perplexity
EleutherAI/gpt-j-6b FP32 0.6831 0.4954 0.6409 0.7541 0.6434 / 10.8816
GPTQ
W4G128Asym
0.679 0.4895 0.6433 0.7476 0.6399 0.9945 11.0999
GPTQ
W4G32Asym
0.6829 0.4923 0.6401 0.7486 0.6410 0.9963 11.0141
GPTQ
W4G128Sym
0.685 0.4907 0.6361 0.7443 0.6390 0.9932 11.1498
GPTQ
W4G32Sym
0.6911 0.4899 0.6448 0.7497 0.6439 1.0008 11.0927
facebook/opt-6.7b FP32 0.6769 0.5049 0.6543 0.7628 0.6497 / 12.2862
GPTQ
W4G32Asym
0.6804 0.4984 0.6535 0.7568 0.6473 0.9962 12.4193
GPTQ
W4G32Sym
0.6885 0.4973 0.6433 0.753 0.6455 0.9935 12.4607
decapoda-research/llama-7b-hf FP32 0.7361 0.5642 0.6709 0.7835 0.6887 / 9.4202
GPTQ
W4G32Asym
0.7244 0.5603 0.6614 0.7835 0.6824 0.9909 9.5881
decapoda-research/llama-13b-hf FP32 0.7627 0.5911 0.7009 0.7878 0.7106 / 8.212
GPTQ
W4G128Asym
0.7518 0.5843 0.6961 0.7911 0.7058 0.9932 8.4319
GPTQ
W4G32Asym
0.7572 0.5898 0.7056 0.7894 0.7105 0.9998 8.3429
GPTQ
W4G128Sym
0.7596 0.5841 0.6977 0.7905 0.7080 0.9963 8.4916
decapoda-research/llama-30b-hf FP32 0.7759 0.6266 0.7277 0.8096 0.7350 / 6.2384
GPTQ
W4G128Asym
0.778 0.624 0.7269 0.8047 0.7334 0.9979 6.4237
GPTQ
W4G32Asym
0.7706 0.6239 0.7285 0.8058 0.7322 0.9963 6.4697
GPTQ
W4G128Sym
0.7836 0.6195 0.7269 0.8047 0.7337 0.9983 6.5604
meta-llama/Llama-2-7b-chat-hf FP32 0.7058 0.5732 0.648 0.7715 0.6746 / 11.7107
GPTQ
W4G128Asym
0.6982 0.5637 0.6527 0.7704 0.6713 0.9950 11.9702
GPTQ
W4G32Asym
0.6953 0.5682 0.6575 0.7758 0.6742 0.9994 11.9317
meta-llama/Llama-2-7b-hf FP32 0.7392 0.567 0.6709 0.7835 0.6902 / 8.7911
GPTQ
W4G32Asym
0.7353 0.5642 0.6622 0.7829 0.6862 0.9942 8.9635
GPTQ
W4G128Sym
0.7246 0.5617 0.6756 0.7797 0.6854 0.9931 9.2799
meta-llama/Llama-2-13b-chat-hf FP32 0.7312 0.6059 0.7103 0.7835 0.7077 / 10.2213
GPTQ
W4G128Asym
0.7273 0.6018 0.7088 0.7742 0.7030 0.9934 2538.083
GPTQ
W4G32Asym
0.7283 0.6053 0.7024 0.7764 0.7031 0.9935 1889.374
GPTQ
W4G128Sym
0.727 0.5997 0.7024 0.778 0.7018 0.9916 2504.497
meta-llama/Llama-2-13b-hf FP32 0.7677 0.5972 0.6961 0.7878 0.7122 / 7.8984
GPTQ
W4G128Asym
0.7627 0.5933 0.689 0.7851 0.7075 0.9934 1556.448
GPTQ
W4G32Asym
0.7675 0.5934 0.6977 0.7856 0.7111 0.9984 1514.927
GPTQ
W4G128Sym
0.7566 0.5899 0.7032 0.7856 0.7088 0.9953 1374.728
bigscience/bloom-7b1 FP32 0.5764 0.4628 0.6456 0.7269 0.6029 / 30.6438
GPTQ
W4G32Sym
0.5799 0.4542 0.6361 0.7312 0.6004 0.9957 32.0626
bigscience/bloomz-7b1 FP32 0.5593 0.4789 0.6527 0.7628 0.6134 / 51.7432
GPTQ
W4G32Asym
0.5525 0.4731 0.6504 0.7617 0.6094 0.9935 52.7828
databricks/dolly-v1-6b FP32 0.6866 0.5098 0.6433 0.7622 0.6505 / 11.3242
GPTQ
W4G128Asym
0.6878 0.5058 0.6393 0.7633 0.6491 0.9978 11.5514
GPTQ
W4G32Asym
0.6864 0.5084 0.6519 0.7568 0.6509 1.0006 11.4728
GPTQ
W4G128Sym
0.6876 0.5045 0.6433 0.7541 0.6474 0.9952 11.6474
databricks/dolly-v2-7b FP32 0.6379 0.5282 0.614 0.7448 0.6312 / 16.161
GPTQ
W4G32Asym
0.6377 0.5228 0.5991 0.7448 0.6261 0.9919 16.4096
EleutherAI/gpt-neo-2.7b FP32 0.6224 0.4271 0.577 0.722 0.5871 / 13.9359
GPTQ
W4G128Asym
0.6123 0.4227 0.5738 0.7203 0.5823 0.9917 14.3377
GPTQ
W4G32Asym
0.615 0.4259 0.5714 0.7247 0.5843 0.9951 14.2083
GPTQ
W4G32Sym
0.6154 0.4208 0.5777 0.7198 0.5834 0.9937 14.3121
EleutherAI/gpt-neox-20b FP32 0.7233 0.5359 0.6614 0.7753 0.6740 / 9.195
GPTQ
W4G128Asym
0.7186 0.5328 0.6535 0.7699 0.6687 0.9922 9.3463
GPTQ
W4G32Asym
0.7268 0.533 0.659 0.7715 0.6726 0.9979 9.2897
mosaicml/mpt-7b FP32 0.7056 0.5718 0.6859 0.7927 0.6890 / 9.9324
GPTQ
W4G128Asym
0.7006 0.5655 0.6803 0.7965 0.6857 0.9952 10.1515
mosaicml/mpt-7b-chat FP32 0.655 0.5752 0.6748 0.7845 0.6724 / 13.5951
GPTQ
W4G128Asym
0.6472 0.5716 0.6685 0.784 0.6678 0.9932 13.8539
mosaicml/mpt-7b-instruct FP32 0.6918 0.5819 0.678 0.7927 0.6861 / 10.8863
GPTQ
W4G128Asym
0.6864 0.5765 0.6827 0.7873 0.6832 0.9958 11.1451
mosaicml/mpt-7b-storywriter FP32 0.693 0.5477 0.663 0.784 0.6719 / 9.9125
GPTQ
W4G128Asym
0.6854 0.5443 0.6661 0.7813 0.6693 0.9961 10.1137
tiiuae/falcon-rw-7b FP32 0.6604 0.5419 0.6598 0.7753 0.6594 / 11.7616
GPTQ
W4G128Asym
0.6484 0.5369 0.6575 0.7807 0.6559 0.9947 11.9411
GPTQ
W4G32Asym
0.6571 0.5398 0.6582 0.7764 0.6579 0.9978 11.8809
GPTQ
W4G128Sym
0.652 0.535 0.6575 0.7682 0.6532 0.9906 12.0048
tiiuae/falcon-7b-instruct FP32 0.6437 0.5177 0.6669 0.7824 0.6527 / 14.5053
GPTQ
W4G128Asym
0.6301 0.5142 0.6654 0.7835 0.6483 0.9933 14.8146
GPTQ
W4G32Asym
0.6377 0.517 0.6598 0.7807 0.6488 0.9941 14.6953

ONNX Models with ONNX Runtime 1.17.1

Model Example Accuracy Performance 1s4c14ins1bs
Throughput(samples/sec)
INT8 FP32 Accuracy Ratio
[(INT8-FP32)/FP32]
INT8 FP32 Performance Ratio
[INT8/FP32]
ResNet50   V1.5 qlinearops 72.16% 72.29% -0.18% 1666.73 734.16 2.27x
ResNet50 V1.5 qdq 72.19% 72.29% -0.15% 1658.10 734.33 2.26x
ResNet50 V1.5 MLPerf qlinearops 76.15% 76.46% -0.41% 1495.15 733.59 2.04x
ResNet50 V1.5 MLPerf qdq 76.12% 76.46% -0.44% 1661.90 732.04 2.27x
ResNet50 V1.5 (ONNX Model Zoo) qlinearops 74.77% 74.99% -0.29% 1713.86 767.91 2.23x
ResNet50 V1.5 (ONNX Model Zoo) qdq 74.48% 74.99% -0.67% 1747.21 770.14 2.27x
MobileNet V2 qlinearops 65.55% 66.89% -2.01% 7519.95 4430.84 1.70x
MobileNet V2 qdq 65.60% 66.89% -1.93% 7572.97 4413.58 1.72x
MobileNet V2 (ONNX Model Zoo) qlinearops 68.51% 69.48% -1.41% 7190.26 4019.16 1.79x
VGG16 qlinearops 66.55% 66.69% -0.20% 613.47 170.95 3.59x
VGG16 qdq 66.62% 66.69% -0.11% 611.78 186.21 3.29x
VGG16 (ONNX Model Zoo) qlinearops 72.37% 72.40% -0.04% 619.00 184.35 3.36x
VGG16 (ONNX Model Zoo) qdq 72.37% 72.40% -0.03% 623.02 172.27 3.62x
MobileNet V3 MLPerf qlinearops 75.51% 75.74% -0.30% 5711.04 2584.17 2.21x
MobileNet V3 MLPerf qdq 75.51% 75.74% -0.30% 6136.36 2630.21 2.33x
ShuffleNet V2 (ONNX Model Zoo) qlinearops 66.13% 66.36% -0.36% 6820.89 3686.46 1.85x
GoogleNet (ONNX Model Zoo) qlinearops 67.69% 67.79% -0.14% 1971.18 1120.08 1.76x
GoogleNet (ONNX Model Zoo) qdq 67.64% 67.79% -0.22% 1838.28 1142.35 1.61x
SqueezeNet (ONNX Model Zoo) qlinearops 56.49% 56.87% -0.67% 10163.13 5771.89 1.76x
SqueezeNet (ONNX Model Zoo) qdq 56.33% 56.87% -0.94% 10339.14 6002.84 1.72x
CaffeNet (ONNX Model Zoo) qlinearops 56.26% 56.30% -0.07% 2805.96 1077.80 2.60x
CaffeNet (ONNX Model Zoo) qdq 56.18% 56.30% -0.21% 4351.65 822.71 5.29x
AlexNet (ONNX Model Zoo) qlinearops 54.73% 54.79% -0.10% 2169.83 893.06 2.43x
AlexNet (ONNX Model Zoo) qdq 54.74% 54.79% -0.08% 2232.07 841.46 2.65x
ZFNet (ONNX Model Zoo) qlinearops 55.83% 55.96% -0.24% 921.09 525.21 1.75x
ZFNet (ONNX Model Zoo) qdq 55.82% 55.96% -0.24% 925.69 534.05 1.73x
Inception V1 (ONNX Model Zoo) qlinearops 67.23% 67.24% -0.02% 1862.37 1161.55 1.60x
Inception V1 (ONNX Model Zoo) qdq 67.19% 67.24% -0.07% 1956.47 1262.64 1.55x
EfficientNet (ONNX Model Zoo) qlinearops 77.02% 77.11% -0.12% 2793.23 1383.39 2.02x
BEIT qlinearops 85.07 85.28 -0.25% 206.50 128.13 1.61x
SSD (ONNX Model Zoo) qdq 18.62% 18.98% -1.90% 56.97 14.57 3.91x
DUC (ONNX Model Zoo) qlinearops 81.62% 81.92% -0.37% 8.76 5.03 1.74x
Ultra Face (ONNX Model Zoo) qlinearops 83.33% 83.65% -0.38% 8780.52 1920.30 4.57x
Emotion FERPlus (ONNX Model Zoo) qlinearops 7.94% 8.00% -0.70% 6360.85 3067.12 2.07x
ArcFace (ONNX Model Zoo) qlinearops 99.82% 99.80% 0.02% 449.50 235.01 1.91x
BERT base MRPC qlinearops 85.78% 86.03% -0.28% 511.36 225.15 2.27x
BERT base MRPC qdq 85.78% 86.03% -0.28% 484.44 222.43 2.18x
BERT base MRPC integerops 85.78% 86.03% -0.28% 728.48 222.35 3.28x
DistilBERT base MRPC qdq 85.05% 84.56% 0.58% 635.93 405.58 1.57x
DistilBERT base MRPC integerops 85.29% 84.56% 0.87% 1324.26 405.48 3.27x
Roberta base MRPC qdq 88.24% 89.95% -1.91% 484.00 223.37 2.17x
BERT SQuAD (ONNX Model Zoo) integerops 80.29 80.67 -0.47% 244.93 99.29 2.47x
BERT base cased MRPC (HuggingFace) qlinearops 90.21% 90.42% -0.23% 440.17 214.15 2.06x
BERT base uncased MRPC (HuggingFace) integerops 89.58% 90.42% -0.93% 715.22 201.24 3.55x
Roberta base MRPC (HuggingFace) qlinearops 91.00% 91.38% -0.41% 434.48 214.20 2.03x
Roberta base MRPC (HuggingFace) integerops 90.85% 91.38% -0.58% 714.20 213.54 3.34x
XLM Roberta base MRPC (HuggingFace) qlinearops 89.37% 90.10% -0.81% 339.02 214.41 1.58x
XLM Roberta base MRPC (HuggingFace) integerops 89.66% 90.10% -0.50% 406.04 215.12 1.89x
Camembert base MRPC (HuggingFace) integerops 89.19% 89.28% -0.10% 712.67 217.68 3.27x
MiniLM L12 H384 uncased MRPC (HuggingFace) qlinearops 90.13% 90.97% -0.93% 1209.98 588.93 2.05x
MiniLM L12 H384 uncased MRPC (HuggingFace) integerops 91.07% 90.97% 0.10% 1268.43 588.05 2.16x
DistilBERT base uncased SST-2 (HuggingFace) qlinearops 90.71% 91.06% -0.38% 1253.85 399.52 3.14x
DistilBERT base uncased SST-2 (HuggingFace) integerops 90.25% 91.06% -0.88% 925.68 399.54 2.32x
MiniLM L6 H384 uncased SST-2 (HuggingFace) qlinearops 89.45% 90.14% -0.76% 2209.72 1139.62 1.94x
MiniLM L6 H384 uncased SST-2 (HuggingFace) integerops 89.91% 90.14% -0.26% 2365.97 1137.32 2.08x
BERT base cased MRPC (HuggingFace) qlinearops 87.70% 88.29% -0.67% 497.73 214.32 2.32x
BERT base cased MRPC (HuggingFace) integerops 88.19% 88.29% -0.12% 718.26 214.32 3.35x
Electra small discriminator MRPC (HuggingFace) qlinearops 89.92% 89.83% 0.09% 1951.07 1142.89 1.71x
Electra small discriminator MRPC (HuggingFace) integerops 89.27% 89.83% -0.63% 2198.93 1129.20 1.95x
BERT mini MRPC (HuggingFace) qlinearops 86.21% 86.52% -0.35% 5814.17 3388.02 1.72x
BERT mini MRPC (HuggingFace) integerops 86.16% 86.52% -0.41% 6396.89 3445.06 1.86x
BART large MRPC (HuggingFace) integerops 92.36% 91.20% 1.28% 126.31 52.28 2.42x
Spanbert SQuAD (HuggingFace) qlinearops 91.14 91.98 -0.91% 75.86 43.48 1.74x
Spanbert SQuAD (HuggingFace) integerops 91.40 91.98 -0.63% 92.24 43.51 2.12x
Bert base multilingual cased SQuAD (HuggingFace) qlinearops 88.42 89.13 -0.79% 79.06 43.45 1.82x
Bert base multilingual cased SQuAD (HuggingFace) integerops 88.70 89.13 -0.48% 93.03 43.23 2.15x
DistilBert base uncased SQuAD (HuggingFace) qlinearops 86.33 86.86 -0.62% 118.68 68.43 1.73x
DistilBert base uncased SQuAD (HuggingFace) integerops 86.05 86.86 -0.94% 186.33 68.41 2.72x
BERT large uncased whole word masking SQuAD (HuggingFace) qlinearops 92.34 93.16 -0.88% 28.67 13.12 2.19x
BERT large uncased whole word masking SQuAD (HuggingFace) integerops 92.99 93.16 -0.18% 32.32 13.14 2.46x
Roberta large SQuAD v2 (HuggingFace) integerops 89.04 89.02 0.02% 32.37 13.40 2.42x
LayoutLMv3 FUNSD (HuggingFace) qlinearops 89.66% 90.49% -0.91% 47.60 27.28 1.74x
LayoutLMv3 FUNSD (HuggingFace) integerops 89.95% 90.49% -0.59% 56.26 27.43 2.05x
LayoutLMv2 (HuggingFace) qlinearops 80.95% 81.17% -0.27% 64.14 38.91 1.65x
LayoutLMv2 (HuggingFace) integerops 80.60% 81.17% -0.71% 67.01 38.84 1.73x

ONNX Models with ONNX Runtime 1.15.0 in WOQ Mode

Model name Configuration Lambada_openai Accuracy Ratio
[INT4/FP32]
Accuracy Perplexity
meta-llama/Llama-2-7b-chat-hf FP32 0.7058 3.2788 /
GPTQ
W4G32Asym
0.7002 3.4124 0.9921
meta-llama/Llama-2-7b-hf FP32 0.7392 3.3950 /
GPTQ
W4G32Asym
0.7312 3.5711 0.9892
meta-llama/Llama-2-13b-chat-hf FP32 0.7312 2.9163 /
GPTQ
W4G128Asym
0.7240 2.9945 0.9902
meta-llama/Llama-2-13b-hf FP32 0.7677 3.0438 /
GPTQ
W4G128Asym
0.7634 3.1186 0.9944
GPTQ
W4G32Asym
0.7615 3.1276 0.9919
meta-llama/Llama-2-70b-chat-hf FP32 0.7543 2.6181 /
RTN
W4G32Asym
0.7518 2.6496 0.9967
meta-llama/Llama-2-70b-hf FP32 0.7964 2.6612 /
RTN
W4G32Sym
0.7941 2.7243 0.9971

Validated Pruning Examples

Model Task
Dataset
Dense Accuracy
Sparse Accuracy
Relative Drop Sparsity ratio
Sparsity Pattern
Comments
Balanced
or unbalanced ratio
Bert-Mini question answering
SQuAD-v1.1
f1=76.87
f1=76.2
-0.80% 80%
structured 4x1
snip momentum
unbalanced
Bert-Mini question answering
SQuAD-v1.1
f1=76.87
f1=76.2
-0.80% 80%
structured 4x1
snip momentum
unbalanced
Bert-Mini question answering
SQuAD-v1.1
f1=76.87
f1=77.62
+0.98% 50%
structured 2:4
snip momentum
balanced
Distilbert-base-uncased question answering
SQuAD-v1.1
f1=86.90
f1=86.15
-0.86% 80%
structured 4x1
snip momentum
unbalanced
Distilbert-base-uncased question answering
SQuAD-v1.1
f1=86.90
f1=87.50
+0.69% 50%
structured 2:4
snip momentum
balanced
Bert-base-uncased question answering
SQuAD-v1.1
f1=88.59
f1=87.78
-0.92% 80%
structured 4x1
snip momentum
unbalanced
Bert-base-uncased question answering
SQuAD-v1.1
f1=88.59
f1=89.40
+0.91% 50%
structured 2:4
snip momentum
balanced
Bert-large question answering
SQuAD-v1.1
f1=91.23
f1=90.91
-0.35% 80%
structured 4x1
snip momentum
unbalanced
Bert-large question answering
SQuAD-v1.1
f1=91.23
f1=91.67
+0.48% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
MRPC
f1=87.52
f1=87.22
-0.34% 90%
structured 4x1
snip momentum
unbalanced
Bert-Mini text classification
MRPC
f1=87.52
f1=87.33
-0.22% 90%
structured 4x1
snip momentum
balanced
Bert-Mini text classification
MRPC
f1=87.52
f1=86.89
-0.72% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
MRPC
f1=87.52
f1=86.8
-0.83% 60%
structured per channel
snip momentum
unbalanced
Distilbert-base-uncased text classification
MRPC
f1=90.26
f1=89.85
-0.46% 90%
structured 4x1
snip momentum
unbalanced
Distilbert-base-uncased text classification
MRPC
f1=90.26
f1=90.88
+0.69% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
SST-2
accuracy=87.61
accuracy=86.92
-0.79% 90%
structured 4x1
snip momentum
unbalanced
Bert-Mini text classification
SST-2
accuracy=87.61
accuracy=87.73
+0.14% 50%
structured 2:4
snip momentum
balanced
Bert-Mini text classification
SST-2
accuracy=87.61
accuracy=86.92
-0.79% 50%
structured per channel
snip momentum
unbalanced
ResNet50 image recognition
ImageNet
top1 acc = 78.95
top1 acc = 80.10
-1.43% 75%
structured 2x1
snip momentum
unbalanced
YOLO-v5s6 object detection
COCO
AP0.50:0.95/AP0.50=0.404/0.6
AP0.50:0.95/AP0.50=0.393/0.584
-2.72% 80%
unstructured
snip momentum
unbalanced
Bert-Large question answering
SQuAD-v1.1
f1=91.34
f1=90.7
-0.07% 80%
structured 2x1
group lasso
unbalanced
Bert-Base text classification
MNLI
[m, mm] = [84.57, 84.79]
[m, mm] = [82.45, 83.27]
[-2.51%, -1.80%] 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
MNLI
[m, mm] = [84.57, 84.79]
[m, mm] = [83.20, 84.11]
[-1.62%, -0.80%] 50%
structured 1:2
Prune once for all
balanced
Bert-Base text classification
SST-2
accuracy = 92.32
accuracy = 91.51
-0.88% 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
SST-2
accuracy = 92.32
accuracy = 92.20
-0.13% 50%
structured 1:2
Prune once for all
balanced
Bert-Base text classification
SST-2
accuracy = 92.32
accuracy = 91.97
-0.38% 20%
unstructured
gradient sensitivity
balanced
Bert-Base text classification
QQP
[accuracy, f1] = [91.10, 88.05]
[accuracy, f1] = [90.48, 87.06]
[-0.68%, -1.12%] 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
QQP
[accuracy, f1] = [91.10, 88.05]
[accuracy, f1] = [90.92, 87.78]
[-0.20%, -0.31%] 50%
structured 1:2
Prune once for all
balanced
Bert-Base text classification
QNLI
accuracy = 91.54
accuracy = 90.39
-1.26% 70%
unstructured
Prune once for all
balanced
Bert-Base text classification
QNLI
accuracy = 91.54
accuracy = 90.87
-0.73% 50%
structured 1:2
Prune once for all
balanced
Bert-Base question answering [em, f1] = [79.34, 87.10]
[em, f1] = [77.27, 85.75]
[-2.61%, -1.54%] 70%
unstructured
Prune once for all
balanced
Bert-Base question answering [em, f1] = [79.34, 87.10]
[em, f1] = [78.03, 86.50]
[-1.65%, -0.69%] 50%
structured 1:2
Prune once for all
balanced

Validated Knowledge Distillation Examples

Example Name Dataset Student
(Metrics)
Teacher
(Metrics)
Student With Distillation
(Metrics Improvement)
Student With
Distributed Distillation
(Metrics Improvement)
MobileNet example CIFAR-10 MobileNetV2-0.35
(0.7965 ACC)
WideResNet40-2
(0.9522 ACC)
0.8178 ACC
(0.0213 ACC)
0.8235 ACC
(0.027 ACC)
CNN example CIFAR-100 CNN-2
(0.5494 ACC)
CNN-10
(0.7153 ACC)
0.5540 ACC
(0.0046 ACC)
0.5523 ACC
(0.0029 ACC)
VGG example CIFAR-100 VGG-8-BN
(0.7022 ACC)
VGG-13-BN
(0.7415 ACC)
0.7025 ACC
(0.0003 ACC)
NA
ResNet example ImageNet ResNet18
(0.6739 ACC)
ResNet50
(0.7399 ACC)
0.6845 ACC
(0.0106 ACC)
NA
BlendCnn example MRPC BlendCnn
(0.7034 ACC)
BERT-Base
(0.8382 ACC)
0.7034 ACC
(0 ACC)
NA
BiLSTM example SST-2 BiLSTM
(0.8314 ACC)
RoBERTa-Base
(0.9403 ACC)
0.9048 ACC
(0.0734 ACC)
NA
DistilBERT example SQuAD DistilBERT
(0.7323/0.8256 EM/F1)
BERT-Base
(0.8084/0.8814 EM/F1)
0.7442/0.8371 EM/F1
(0.0119/0.0115 EM/F1)
NA
TinyBERT example MNLI TinyBERT
(0.8018/0.8044 m/mm)
BERT-Base
(0.8363/0.8411 m/mm)
0.8025/0.8074 m/mm
(0.0007/0.0030 m/mm)
NA
BERT-3 example QQP BERT-3
(0.8626/0.8213 EM/F1)
BERT-Base
(0.9091/0.8782 EM/F1)
0.8684/0.8259 EM/F1
(0.0058/0.0046 EM/F1)
NA
DistilRoBERTa example COLA DistilRoBERTa
(0.6057 ACC)
RoBERTa-Large
(0.6455 ACC)
0.6187 ACC
(0.0130 ACC)
NA

Validated ONNX QDQ INT8 Models on Multiple Hardware through ONNX Runtime

Model (ONNX QDQ) AWS c6i.2xlarge (Intel)
CPU Execution Provider
AWS c6a.2xlarge (AMD)
CPU Execution Provider
AWS c6g.2xlarge (ARM)
CPU Execution Provider
NVidia A100
CUDA Execution
Provider
ResNet50 74.76% 68.95% 74.76% 74.75%
BERT-base 85.54% 84.56% 85.54% 84.31%
ResNet50 V1.5 72.20% 67.70% 72.20% 72.29%
MobileNet V2 65.82% 58.56% 65.83% 65.63%
SSD MobileNet V1 22.45% 16.53% 22.45% 22.35%
DistilBERT base MRPC 84.56% 83.82% 84.56% 84.56%
SqueezeNet 56.54% 53.52% 56.54% 56.55%
SSD 18.63% 18.54% 18.63% 18.61%
AlexNet 54.71% 47.06% 54.71% 54.79%
CaffeNet 56.25% 52.35% 56.27% 56.24%
GoogleNet 67.73% 63.56% 67.72% 67.76%
ZFNet 55.86% 45.09% 55.86% 55.89%
Inception V1 67.21% 63.03% 67.20% 67.21%
SSD MobileNet V1 (ONNX Model Zoo) 22.86% 16.94% 22.80% 22.87%
Mobile bert MRPC 85.54% 84.56% 85.54% 85.54%
Roberta base MRPC 89.46% 90.44% 89.71% 89.71%
ResNet50 V1.5 MLPerf 76.14% 72.80% 76.14% 76.17%
VGG16 66.69% 64.25% 66.69% 66.64%
VGG16 (ONNX Model Zoo) 72.31% 69.35% 72.32% 72.34%
MobileNet V3 MLPerf 75.57% 70.78% 75.56% 75.52%
EfficientNet 77.61% 76.52% 77.56% 77.60%
MobileNet V2 (ONNX Model Zoo) 68.51% 62.48% 68.58% 68.48%
ShuffleNet V2 66.12% 58.41% 66.11% 66.11%