Getting Started

  1. Quick Samples

  2. Feature Matrix

Quick Samples

# Install Intel Neural Compressor
pip install neural-compressor-pt
from transformers import AutoModelForCausalLM
from neural_compressor.torch.quantization import RTNConfig, prepare, convert

user_model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
quant_config = RTNConfig()
prepared_model = prepare(model=user_model, quant_config=quant_config)
quantized_model = convert(model=prepared_model)

Feature Matrix

Intel Neural Compressor 3.X extends PyTorch and TensorFlow’s APIs to support compression techniques. The below table provides a quick overview of the APIs available in Intel Neural Compressor 3.X. The Intel Neural Compressor 3.X mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference. It also provides some common modules across different frameworks. For example, Auto-tune support accuracy driven quantization and mixed precision, benchmark aimed to measure the multiple instances performance of the quantized model.

Overview
Architecture Workflow APIs LLMs Recipes Examples
PyTorch Extension APIs
Overview Static Quantization Dynamic Quantization Smooth Quantization
Weight-Only Quantization MX Quantization Mixed Precision
Tensorflow Extension APIs
Overview Static Quantization Smooth Quantization
Other Modules
Auto Tune Benchmark

Note: From 3.0 release, we recommend to use 3.X API. Compression techniques during training such as QAT, Pruning, Distillation only available in 2.X API currently.