Examples
Reference Deployment on Neural Engine
6.1 Dense Reference
6.2 Sparse Reference
Intel Extension for Transformers is a powerful toolkit with multiple model optimization techniques for Natural Language Processing Models, including quantization, pruning, distillation, auto distillation and orchestrate. Meanwhile Intel Extension for Transformers provides Transformers-accelerated Neural Engine, an optimized backend for NLP models to demonstrate the deployment.
Quantization
Stock PyTorch Examples
Model | Task | Dataset | qat | No Trainer quantization |
---|---|---|---|---|
textattack/bert-base-uncased-MRPC | text-classification | MRPC | ✔ | |
echarlaix/bert-base-uncased-sst2-acc91.1-d37-hybrid | text-classification | SST-2 | ✔ |
Intel Extension for Pytorch (IPEX) examples
Model | Task | Dataset | static |
---|---|---|---|
distilbert-base-uncased-distilled-squad | question-answering | SQuAD | ✔ |
bert-large-uncased-whole-word-maskinuned-squad | question-answering | SQuAD | ✔ |
Intel TensorFlow Examples
Model | Task | Dataset | static |
---|---|---|---|
bert-base-cased-finetuned-mrpc | text-classification | MRPC | ✔ |
xlnet-base-cased | text-classification | MRPC | ✔ |
distilgpt2 | language-modeling(CLM) | wikitext | ✔ |
distilbert-base-cased | language-modeling(MLM) | wikitext | ✔ |
Rocketknight1/bert-base-uncased-finetuned-swag | multiple-choice | swag | ✔ |
dslim/bert-base-NER | token-classification | conll2003 | ✔ |
Length Adaptive Transformers
Model Name | Datatype | Optimization Method | Modelsize (MB) | Inference Result | |||
---|---|---|---|---|---|---|---|
Accuracy(F1) | Latency(ms) | GFLOPS** | Speedup(compared with BERT Base) | ||||
BERT Base | fp32 | None | 415.47 | 88.58 | 56.56 | 35.3 | 1x |
LA-MiniLM | fp32 | Drop and restore base MiniLMv2 | 115.04 | 89.28 | 16.99 | 4.76 | 3.33x |
LA-MiniLM(269, 253, 252, 202, 104, 34)* | fp32 | Evolution search (best config) | 115.04 | 87.76 | 11.44 | 2.49 | 4.94x |
QuaLA-MiniLM | int8 | Quantization base LA-MiniLM | 84.85 | 88.85 | 7.84 | 4.76 | 7.21x |
QuaLA-MiniLM(315,251,242,159,142,33)* | int8 | Evolution search (best config) | 84.86 | 87.68 | 6.41 | 2.55 | 8.82x |
Note: * length config apply to Length Adaptive model
Note: ** the multiplication and addition operation amount when model inference (GFLOPS is obtained from torchprofile tool)
Data is tested on Intel Xeon Platinum 8280 Scalable processor. Configuration detail please refer to examples
Pruning
Model | Task | Dataset | Pruning Approach | Pruning Type | Framework |
---|---|---|---|---|---|
distilbert-base-uncased-distilled-squad | question-answering | SQuAD | BasicMagnitude | Unstructured | Stock PyTorch |
bert-large-uncased | question-answering | SQuAD | Group LASSO | Structured | Stock PyTorch |
distilbert-base-uncased-finetuned-sst-2-english | text-classification | SST-2 | BasicMagnitude | Unstructured | Stock PyTorch/ Intel TensorFlow |
Distillation
Knowledge Distillation
Student Model | Teacher Model | Task | Dataset |
---|---|---|---|
distilbert-base-uncased | bert-base-uncased-SST-2 | text-classification | SST-2 |
distilbert-base-uncased | bert-base-uncased-QNLI | text-classification | QNLI |
distilbert-base-uncased | bert-base-uncased-QQP | text-classification | QQP |
distilbert-base-uncased | bert-base-uncased-MNLI-v1 | text-classification | MNLI |
distilbert-base-uncased | bert-base-uncased-squad-v1 | question-answering | SQuAD |
TinyBERT_General_4L_312D | bert-base-uncased-MNLI-v1 | text-classification | MNLI |
distilroberta-base | roberta-large-cola-krishna2020 | text-classification | COLA |
Orchestrate
Model | Task | Dataset | Distillation Teacher | Pruning Approach | Pruning Type |
---|---|---|---|---|---|
Intel/distilbert-base-uncased-sparse-90-unstructured-pruneofa | question-answering | SQuAD | distilbert-base-uncased-distilled-squad | PatternLock | Unstructured |
BasicMagnitude | Unstructured | ||||
text-classification | SST-2 | distilbert-base-uncased-finetuned-sst-2-english | PatternLock | Unstructured | |
BasicMagnitude | Unstructured |
Reference Deployment on Neural Engine
Dense Reference Deployment on Neural Engine
Model | Task | Dataset | Datatype | |
---|---|---|---|---|
INT8 | BF16 | |||
bert-large-uncased-whole-word-masking-finetuned-squad | question-answering | SQuAD | ✔ | ✔ |
bhadresh-savani/distilbert-base-uncased-emotion | text-classification | emotion | ✔ | ✔ |
textattack/bert-base-uncased-MRPC | text-classification | MRPC | ✔ | ✔ |
textattack/distilbert-base-uncased-MRPC | text-classification | MRPC | ✔ | ✔ |
Intel/roberta-base-mrpc | text-classification | MRPC | ✔ | ✔ |
M-FAC/bert-mini-finetuned-mrpc | text-classification | MRPC | ✔ | ✔ |
gchhablani/bert-base-cased-finetuned-mrpc | text-classification | MRPC | ✔ | ✔ |
distilbert-base-uncased-finetuned-sst-2-english | text-classification | SST-2 | ✔ | ✔ |
philschmid/MiniLM-L6-H384-uncased-sst2 | text-classification | SST-2 | ✔ | ✔ |
moshew/bert-mini-sst2-distilled | text-classification | SST-2 | ✔ | ✔ |
Sparse Reference Deployment on Neural Engine
Model | Task | Dataset | Datatype | |
---|---|---|---|---|
INT8 | BF16 | |||
Intel/distilbert-base-uncased-squadv1.1-sparse-80-1x4-block-pruneofa | question-answering | SQuAD | ✔ | WIP :star: |
Intel/bert-mini-sst2-distilled-sparse-90-1X4-block | text-classification | SST-2 | ✔ | WIP :star: |
Early-Exit
Model | Task | Dataset | Early-Exit Type |
---|---|---|---|
bert-base-uncased | text-classification | MNLI | SWEET notebook |
philschmid/tiny-bert-sst2-distilled textattack/roberta-base-SST-2 |
text-classification | SST-2 | TangoBERT notebook |