Getting Started

Quick Samples
Feature Matrix

Quick Samples

# Install Intel Neural Compressor
pip install neural-compressor-pt

from transformers import AutoModelForCausalLM
from neural_compressor.torch.quantization import RTNConfig, prepare, convert

user_model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
quant_config = RTNConfig()
prepared_model = prepare(model=user_model, quant_config=quant_config)
quantized_model = convert(model=prepared_model)

Feature Matrix

Intel Neural Compressor 3.X extends PyTorch and TensorFlow’s APIs to support compression techniques. The below table provides a quick overview of the APIs available in Intel Neural Compressor 3.X. The Intel Neural Compressor 3.X mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference. It also provides some common modules across different frameworks. For example, Auto-tune support accuracy driven quantization and mixed precision, benchmark aimed to measure the multiple instances performance of the quantized model.

Overview
Architecture	Workflow		APIs	LLMs Recipes	Examples
PyTorch Extension APIs
Overview	Static Quantization		Dynamic Quantization	Smooth Quantization
Weight-Only Quantization		MX Quantization		Mixed Precision
Tensorflow Extension APIs
Overview		Static Quantization		Smooth Quantization
Other Modules
Auto Tune			Benchmark

Note: From 3.0 release, we recommend to use 3.X API. Compression techniques during training such as QAT, Pruning, Distillation only available in 2.X API currently.