Intel® Extension for Transformers: Accelerating Transformer-based Models on Intel Platforms

Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:

Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022’s paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021’s paper Prune Once for All: Sparse Pre-Trained Language Models)
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5 and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
NeuralChat, a custom Chatbot trained on Intel CPUs through parameter-efficient fine-tuning PEFT on domain knowledge

Selected Publications/Events

Blog published on Medium: Create Your Own Custom Chatbot (April 2023)
Blog of Tech-Innovation Artificial-Intelligence(AI): Intel® Xeon® Processors Are Still the Only CPU With MLPerf Results, Raising the Bar By 5x - Intel Communities (April 2023)
Blog published on Medium: MLefficiency — Optimizing transformer models for efficiency (Dec 2022)
NeurIPS’2022: Fast Distilbert on CPUs (Nov 2022)
NeurIPS’2022: QuaLA-MiniLM: a Quantized Length Adaptive MiniLM (Nov 2022)
Blog published by Cohere: Top NLP Papers—November 2022 (Nov 2022)
Blog published by Alibaba: Deep learning inference optimization for Address Purification (Aug 2022)
NeurIPS’2021: Prune Once for All: Sparse Pre-Trained Language Models (Nov 2021)