Getting Started
Quick Samples
# Install Intel Neural Compressor
pip install neural-compressor-pt
from transformers import AutoModelForCausalLM
from neural_compressor.torch.quantization import RTNConfig, prepare, convert
user_model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-125m")
quant_config = RTNConfig()
prepared_model = prepare(model=user_model, quant_config=quant_config)
quantized_model = convert(model=prepared_model)
Feature Matrix
Intel Neural Compressor 3.X extends PyTorch and TensorFlow’s APIs to support compression techniques. The below table provides a quick overview of the APIs available in Intel Neural Compressor 3.X. The Intel Neural Compressor 3.X mainly focuses on quantization-related features, especially for algorithms that benefit LLM accuracy and inference. It also provides some common modules across different frameworks. For example, Auto-tune support accuracy driven quantization and mixed precision, benchmark aimed to measure the multiple instances performance of the quantized model.
Overview | |||||||
---|---|---|---|---|---|---|---|
Architecture | Workflow | APIs | LLMs Recipes | Examples | |||
PyTorch Extension APIs | |||||||
Overview | Static Quantization | Dynamic Quantization | Smooth Quantization | ||||
Weight-Only Quantization | MX Quantization | Mixed Precision | |||||
Tensorflow Extension APIs | |||||||
Overview | Static Quantization | Smooth Quantization | |||||
Other Modules | |||||||
Auto Tune | Benchmark |
Note: From 3.0 release, we recommend to use 3.X API. Compression techniques during training such as QAT, Pruning, Distillation only available in 2.X API currently.