LLMs Quantization Recipes

Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch, Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.

Notes:

Large Language Models Recipes

Models SQ INT8 WOQ INT8 WOQ INT4
EleutherAI/gpt-j-6b
facebook/opt-1.3b
facebook/opt-30b
meta-llama/Llama-2-7b-hf WIP
meta-llama/Llama-2-13b-hf WIP
meta-llama/Llama-2-70b-hf
tiiuae/falcon-7b
tiiuae/falcon-40b
baichuan-inc/Baichuan-13B-Chat
baichuan-inc/Baichuan2-13B-Chat
baichuan-inc/Baichuan2-7B-Chat
bigscience/bloom-1b7
databricks/dolly-v2-12b
EleutherAI/gpt-neox-20b
mistralai/Mistral-7B-v0.1
THUDM/chatglm2-6b WIP WIP
THUDM/chatglm3-6b WIP

Detail recipes can be found HERE.

Notes:

  • This model list comes from IPEX.

  • The WIP recipes will be published soon.

Large Language Models Accuracy

Model lambada_openai
FP32 SQ INT8 WOQ INT8 WOQ INT4 GPTQ WOQ INT4 AutoRound
ACC ACC Ratio ACC Ratio ACC Ratio ACC Ratio
baichuan-inc/Baichuan-13B-Chat 67.57% 67.86% 1.0043 67.55% 0.9997 67.46% 0.9984 N/A N/A
baichuan-inc/Baichuan2-13B-Chat 71.51% 75.51% 1.0559 71.57% 1.0008 71.45% 0.9992 70.87% 0.9911
baichuan-inc/Baichuan2-7B-Chat 67.67% 67.51% 0.9976 67.61% 0.9991 68.08% 1.0061 67.18% 0.9928
bigscience/bloom-1b7 46.34% 47.97% 1.0352 46.21% 0.9972 47.00% 1.0142 N/A N/A
databricks/dolly-v2-12b 64.35% N/A N/A 63.92% 0.9933 N/A N/A N/A N/A
EleutherAI/gpt-j-6b 68.31% 68.00% 0.9955 68.27% 0.9994 68.23% 0.9988 67.40% 0.9867
EleutherAI/gpt-neox-20b 72.33% N/A N/A 72.29% 0.9994 72.15% 0.9975 N/A N/A
facebook/opt-1.3b 57.89% 57.35% 0.9907 58.12% 1.0040 58.01% 1.0021 N/A N/A
facebook/opt-30b 71.49% 71.51% 1.0003 71.53% 1.0006 71.82% 1.0046 71.43% 0.9992
meta-llama/Llama-2-13b-hf 76.77% N/A N/A 76.89% 1.0016 76.96% 1.0025 N/A N/A
meta-llama/Llama-2-70b-hf 79.64% 79.53% 0.9986 79.62% 0.9997 80.05% 1.0051 N/A N/A
meta-llama/Llama-2-7b-hf 73.92% N/A N/A 73.90% 0.9997 73.51% 0.9945 N/A N/A
mistralai/Mistral-7B-v0.1 75.90% N/A N/A 75.80% 0.9987 75.37% 0.9930 75.82% 0.9989
THUDM/chatglm2-6b 53.23% N/A N/A 53.00% 0.9957 N/A N/A N/A N/A
THUDM/chatglm3-6b 59.09% N/A N/A 59.03% 0.9990 N/A N/A 58.59% 0.9915
tiiuae/falcon-40b 77.22% 77.26% 1.0005 77.18% 0.9995 77.97% 1.0097 N/A N/A
tiiuae/falcon-7b 74.67% 76.17% 1.0201 74.73% 1.0008 74.79% 1.0016 N/A N/A