## LLMs Quantization Recipes Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/), [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss. > Notes: > > - The quantization algorithms provide by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and the evaluate functions provide by [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). > - The model list are continuing update, please expect to find more LLMs in the future. ## Large Language Models Recipes | Models | SQ INT8 | WOQ INT8 | WOQ INT4 | | :-----------------------------: | :-----: | :------: | :------: | | EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ | | facebook/opt-1.3b | ✔ | ✔ | ✔ | | facebook/opt-30b | ✔ | ✔ | ✔ | | meta-llama/Llama-2-7b-hf | WIP | ✔ | ✔ | | meta-llama/Llama-2-13b-hf | WIP | ✔ | ✔ | | meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ | | tiiuae/falcon-7b | ✔ | ✔ | ✔ | | tiiuae/falcon-40b | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-7B-Chat | ✔ | ✔ | ✔ | | bigscience/bloom-1b7 | ✔ | ✔ | ✔ | | databricks/dolly-v2-12b | ✖ | ✔ | ✖ | | EleutherAI/gpt-neox-20b | ✖ | ✔ | ✔ | | mistralai/Mistral-7B-v0.1 | ✖ | ✔ | ✔ | | THUDM/chatglm2-6b | WIP | ✔ | WIP | | THUDM/chatglm3-6b | WIP | ✔ | ✔ | **Detail recipes can be found [HERE](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/llm_quantization_recipes.html).** > Notes: > > - This model list comes from [IPEX](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html). > - The WIP recipes will be published soon. ## Large Language Models Accuracy

Model	lambada_openai
	FP32	SQ INT8		WOQ INT8		WOQ INT4 GPTQ		WOQ INT4 AutoRound
	ACC	ACC	Ratio	ACC	Ratio	ACC	Ratio	ACC	Ratio
baichuan-inc/Baichuan-13B-Chat	67.57%	67.86%	1.0043	67.55%	0.9997	67.46%	0.9984	N/A	N/A
baichuan-inc/Baichuan2-13B-Chat	71.51%	75.51%	1.0559	71.57%	1.0008	71.45%	0.9992	70.87%	0.9911
baichuan-inc/Baichuan2-7B-Chat	67.67%	67.51%	0.9976	67.61%	0.9991	68.08%	1.0061	67.18%	0.9928
bigscience/bloom-1b7	46.34%	47.97%	1.0352	46.21%	0.9972	47.00%	1.0142	N/A	N/A
databricks/dolly-v2-12b	64.35%	N/A	N/A	63.92%	0.9933	N/A	N/A	N/A	N/A
EleutherAI/gpt-j-6b	68.31%	68.00%	0.9955	68.27%	0.9994	68.23%	0.9988	67.40%	0.9867
EleutherAI/gpt-neox-20b	72.33%	N/A	N/A	72.29%	0.9994	72.15%	0.9975	N/A	N/A
facebook/opt-1.3b	57.89%	57.35%	0.9907	58.12%	1.0040	58.01%	1.0021	N/A	N/A
facebook/opt-30b	71.49%	71.51%	1.0003	71.53%	1.0006	71.82%	1.0046	71.43%	0.9992
meta-llama/Llama-2-13b-hf	76.77%	N/A	N/A	76.89%	1.0016	76.96%	1.0025	N/A	N/A
meta-llama/Llama-2-70b-hf	79.64%	79.53%	0.9986	79.62%	0.9997	80.05%	1.0051	N/A	N/A
meta-llama/Llama-2-7b-hf	73.92%	N/A	N/A	73.90%	0.9997	73.51%	0.9945	N/A	N/A
mistralai/Mistral-7B-v0.1	75.90%	N/A	N/A	75.80%	0.9987	75.37%	0.9930	75.82%	0.9989
THUDM/chatglm2-6b	53.23%	N/A	N/A	53.00%	0.9957	N/A	N/A	N/A	N/A
THUDM/chatglm3-6b	59.09%	N/A	N/A	59.03%	0.9990	N/A	N/A	58.59%	0.9915
tiiuae/falcon-40b	77.22%	77.26%	1.0005	77.18%	0.9995	77.97%	1.0097	N/A	N/A
tiiuae/falcon-7b	74.67%	76.17%	1.0201	74.73%	1.0008	74.79%	1.0016	N/A	N/A