LLMs Quantization Recipes
Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ),
and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch,
Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.
Notes:
The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.
The model list are continuing update, please expect to find more LLMs in the future.
Large Language Models Recipes
Models | SQ INT8 | WOQ INT8 | WOQ INT4 |
---|---|---|---|
EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ |
facebook/opt-1.3b | ✔ | ✔ | ✔ |
facebook/opt-30b | ✔ | ✔ | ✔ |
meta-llama/Llama-2-7b-hf | WIP | ✔ | ✔ |
meta-llama/Llama-2-13b-hf | WIP | ✔ | ✔ |
meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ |
tiiuae/falcon-7b | ✔ | ✔ | ✔ |
tiiuae/falcon-40b | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan-13B-Chat | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan2-13B-Chat | ✔ | ✔ | ✔ |
baichuan-inc/Baichuan2-7B-Chat | ✔ | ✔ | ✔ |
bigscience/bloom-1b7 | ✔ | ✔ | ✔ |
databricks/dolly-v2-12b | ✖ | ✔ | ✖ |
EleutherAI/gpt-neox-20b | ✖ | ✔ | ✔ |
mistralai/Mistral-7B-v0.1 | ✖ | ✔ | ✔ |
THUDM/chatglm2-6b | WIP | ✔ | WIP |
THUDM/chatglm3-6b | WIP | ✔ | ✔ |
Detail recipes can be found HERE.
Notes:
This model list comes from IPEX.
The WIP recipes will be published soon.
Large Language Models Accuracy
Model | lambada_openai | ||||||||
---|---|---|---|---|---|---|---|---|---|
FP32 | SQ INT8 | WOQ INT8 | WOQ INT4 GPTQ | WOQ INT4 AutoRound | |||||
ACC | ACC | Ratio | ACC | Ratio | ACC | Ratio | ACC | Ratio | |
baichuan-inc/Baichuan-13B-Chat | 67.57% | 67.86% | 1.0043 | 67.55% | 0.9997 | 67.46% | 0.9984 | N/A | N/A |
baichuan-inc/Baichuan2-13B-Chat | 71.51% | 75.51% | 1.0559 | 71.57% | 1.0008 | 71.45% | 0.9992 | 70.87% | 0.9911 |
baichuan-inc/Baichuan2-7B-Chat | 67.67% | 67.51% | 0.9976 | 67.61% | 0.9991 | 68.08% | 1.0061 | 67.18% | 0.9928 |
bigscience/bloom-1b7 | 46.34% | 47.97% | 1.0352 | 46.21% | 0.9972 | 47.00% | 1.0142 | N/A | N/A |
databricks/dolly-v2-12b | 64.35% | N/A | N/A | 63.92% | 0.9933 | N/A | N/A | N/A | N/A |
EleutherAI/gpt-j-6b | 68.31% | 68.00% | 0.9955 | 68.27% | 0.9994 | 68.23% | 0.9988 | 67.40% | 0.9867 |
EleutherAI/gpt-neox-20b | 72.33% | N/A | N/A | 72.29% | 0.9994 | 72.15% | 0.9975 | N/A | N/A |
facebook/opt-1.3b | 57.89% | 57.35% | 0.9907 | 58.12% | 1.0040 | 58.01% | 1.0021 | N/A | N/A |
facebook/opt-30b | 71.49% | 71.51% | 1.0003 | 71.53% | 1.0006 | 71.82% | 1.0046 | 71.43% | 0.9992 |
meta-llama/Llama-2-13b-hf | 76.77% | N/A | N/A | 76.89% | 1.0016 | 76.96% | 1.0025 | N/A | N/A |
meta-llama/Llama-2-70b-hf | 79.64% | 79.53% | 0.9986 | 79.62% | 0.9997 | 80.05% | 1.0051 | N/A | N/A |
meta-llama/Llama-2-7b-hf | 73.92% | N/A | N/A | 73.90% | 0.9997 | 73.51% | 0.9945 | N/A | N/A |
mistralai/Mistral-7B-v0.1 | 75.90% | N/A | N/A | 75.80% | 0.9987 | 75.37% | 0.9930 | 75.82% | 0.9989 |
THUDM/chatglm2-6b | 53.23% | N/A | N/A | 53.00% | 0.9957 | N/A | N/A | N/A | N/A |
THUDM/chatglm3-6b | 59.09% | N/A | N/A | 59.03% | 0.9990 | N/A | N/A | 58.59% | 0.9915 |
tiiuae/falcon-40b | 77.22% | 77.26% | 1.0005 | 77.18% | 0.9995 | 77.97% | 1.0097 | N/A | N/A |
tiiuae/falcon-7b | 74.67% | 76.17% | 1.0201 | 74.73% | 1.0008 | 74.79% | 1.0016 | N/A | N/A |