## LLMs Quantization Recipes Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/), [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss. > Notes: > > - The quantization algorithms provide by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and the evaluate functions provide by [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). > - The model list are continuing update, please expect to find more LLMs in the future. ## Large Language Models Recipes | Models | SQ INT8 | WOQ INT8 | WOQ INT4 | | :-----------------------------: | :-----: | :------: | :------: | | EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ | | facebook/opt-1.3b | ✔ | ✔ | ✔ | | facebook/opt-30b | ✔ | ✔ | ✔ | | meta-llama/Llama-2-7b-hf | ✔ | ✔ | ✔ | | meta-llama/Llama-2-13b-hf | ✔ | ✔ | ✔ | | meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ | | tiiuae/falcon-7b | ✔ | ✔ | ✔ | | tiiuae/falcon-40b | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-7B-Chat | ✔ | ✔ | ✔ | | bigscience/bloom-1b7 | ✔ | ✔ | ✔ | | databricks/dolly-v2-12b | ✖ | ✔ | ✖ | | EleutherAI/gpt-neox-20b | ✖ | ✔ | ✔ | | mistralai/Mistral-7B-v0.1 | ✖ | ✔ | ✔ | | THUDM/chatglm2-6b | WIP | ✔ | ✔ | | THUDM/chatglm3-6b | WIP | ✔ | ✔ | **Detail recipes can be found [HERE](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/llm_quantization_recipes.html).** > Notes: > > - This model list comes from [IPEX](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html). > - The WIP recipes will be published soon. ## Large Language Models Accuracy
Model lambada_openai
FP32 SQ INT8 WOQ INT8 WOQ INT4 GPTQ WOQ INT4 AutoRound
ACC ACC Ratio ACC Ratio ACC Ratio ACC Ratio
baichuan-inc/Baichuan-13B-Chat 67.57% 68.23% 1.0098 67.57% 1.0000 67.84% 1.0040 NA NA
baichuan-inc/Baichuan2-13B-Chat 71.51% 70.89% 0.9913 71.53% 1.0003 71.76% 1.0035 NA NA
baichuan-inc/Baichuan2-7B-Chat 67.67% 67.96% 1.0043 67.59% 0.9988 67.24% 0.9936 67.42% 0.9963
bigscience/bloom-1b7 46.34% 47.99% 1.0356 46.38% 1.0009 46.19% 0.9968 NA NA
databricks/dolly-v2-12b 64.35% NA NA 64.10% 0.9961 NA NA NA NA
EleutherAI/gpt-j-6b 68.31% 68.33% 1.0003 68.23% 0.9988 68.79% 1.0070 68.43% 1.0018
EleutherAI/gpt-neox-20b 72.33% NA NA 72.25% 0.9989 71.96% 0.9949 NA NA
facebook/opt-1.3b 57.89% 57.54% 0.9940 58.08% 1.0033 58.57% 1.0117 NA NA
facebook/opt-30b 71.49% 71.51% 1.0003 71.51% 1.0003 71.82% 1.0046 72.11% 1.0087
meta-llama/Llama-2-13b-hf 76.77% 76.25% 0.9932 76.75% 0.9997 77.43% 1.0086 76.75% 0.9997
meta-llama/Llama-2-70b-hf 79.64% 79.55% 0.9989 79.57% 0.9991 80.09% 1.0057 79.97% 1.0041
meta-llama/Llama-2-7b-hf 73.92% 73.45% 0.9936 73.96% 1.0005 73.45% 0.9936 73.49% 0.9942
mistralai/Mistral-7B-v0.1 75.90% NA NA 75.80% 0.9987 76.13% 1.0030 75.61% 0.9962
THUDM/chatglm2-6b 53.23% NA NA 53.19% 0.9992 52.77% 0.9914 53.35% 1.0023
THUDM/chatglm3-6b 59.09% NA NA 59.01% 0.9986 NA NA 58.61% 0.9919
tiiuae/falcon-40b 77.22% 77.04% 0.9977 77.22% 1.0000 77.94% 1.0093 78.79% 1.0203
tiiuae/falcon-7b 74.67% 76.44% 1.0237 74.77% 1.0013 75.00% 1.0044 NA NA