## LLMs Quantization Recipes Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/), [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss. > Notes: > > - The quantization algorithms provide by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and the evaluate functions provide by [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). > - The model list are continuing update, please expect to find more LLMs in the future. ## Large Language Models Recipes | Models | SQ INT8 | WOQ INT8 | WOQ INT4 | | :-----------------------------: | :-----: | :------: | :------: | | EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ | | facebook/opt-1.3b | ✔ | ✔ | ✔ | | facebook/opt-30b | ✔ | ✔ | ✔ | | meta-llama/Llama-2-7b-hf | WIP | ✔ | ✔ | | meta-llama/Llama-2-13b-hf | WIP | ✔ | ✔ | | meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ | | tiiuae/falcon-7b | ✔ | ✔ | ✔ | | tiiuae/falcon-40b | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-7B-Chat | ✔ | ✔ | ✔ | | bigscience/bloom-1b7 | ✔ | ✔ | ✔ | | databricks/dolly-v2-12b | ✖ | ✔ | ✖ | | EleutherAI/gpt-neox-20b | ✖ | ✔ | ✔ | | mistralai/Mistral-7B-v0.1 | ✖ | ✔ | ✔ | | THUDM/chatglm2-6b | WIP | ✔ | WIP | | THUDM/chatglm3-6b | WIP | ✔ | ✔ | **Detail recipes can be found [HERE](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/llm_quantization_recipes.html).** > Notes: > > - This model list comes from [IPEX](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html). > - The WIP recipes will be published soon. ## Large Language Models Accuracy
Model lambada_openai
FP32 SQ INT8 WOQ INT8 WOQ INT4 GPTQ WOQ INT4 AutoRound
ACC ACC Ratio ACC Ratio ACC Ratio ACC Ratio
baichuan-inc/Baichuan-13B-Chat 67.57% 67.86% 1.0043 67.55% 0.9997 67.46% 0.9984 N/A N/A
baichuan-inc/Baichuan2-13B-Chat 71.51% 75.51% 1.0559 71.57% 1.0008 71.45% 0.9992 70.87% 0.9911
baichuan-inc/Baichuan2-7B-Chat 67.67% 67.51% 0.9976 67.61% 0.9991 68.08% 1.0061 67.18% 0.9928
bigscience/bloom-1b7 46.34% 47.97% 1.0352 46.21% 0.9972 47.00% 1.0142 N/A N/A
databricks/dolly-v2-12b 64.35% N/A N/A 63.92% 0.9933 N/A N/A N/A N/A
EleutherAI/gpt-j-6b 68.31% 68.00% 0.9955 68.27% 0.9994 68.23% 0.9988 67.40% 0.9867
EleutherAI/gpt-neox-20b 72.33% N/A N/A 72.29% 0.9994 72.15% 0.9975 N/A N/A
facebook/opt-1.3b 57.89% 57.35% 0.9907 58.12% 1.0040 58.01% 1.0021 N/A N/A
facebook/opt-30b 71.49% 71.51% 1.0003 71.53% 1.0006 71.82% 1.0046 71.43% 0.9992
meta-llama/Llama-2-13b-hf 76.77% N/A N/A 76.89% 1.0016 76.96% 1.0025 N/A N/A
meta-llama/Llama-2-70b-hf 79.64% 79.53% 0.9986 79.62% 0.9997 80.05% 1.0051 N/A N/A
meta-llama/Llama-2-7b-hf 73.92% N/A N/A 73.90% 0.9997 73.51% 0.9945 N/A N/A
mistralai/Mistral-7B-v0.1 75.90% N/A N/A 75.80% 0.9987 75.37% 0.9930 75.82% 0.9989
THUDM/chatglm2-6b 53.23% N/A N/A 53.00% 0.9957 N/A N/A N/A N/A
THUDM/chatglm3-6b 59.09% N/A N/A 59.03% 0.9990 N/A N/A 58.59% 0.9915
tiiuae/falcon-40b 77.22% 77.26% 1.0005 77.18% 0.9995 77.97% 1.0097 N/A N/A
tiiuae/falcon-7b 74.67% 76.17% 1.0201 74.73% 1.0008 74.79% 1.0016 N/A N/A