## LLMs Quantization Recipes Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with [PyTorch](https://pytorch.org/), [Intel® Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss. > Notes: > > - The quantization algorithms provide by [Intel® Neural Compressor](https://github.com/intel/neural-compressor) and the evaluate functions provide by [Intel® Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). > - The model list are continuing update, please expect to find more LLMs in the future. ## Large Language Models Recipes | Models | SQ INT8 | WOQ INT8 | WOQ INT4 | | :-----------------------------: | :-----: | :------: | :------: | | EleutherAI/gpt-j-6b | ✔ | ✔ | ✔ | | facebook/opt-1.3b | ✔ | ✔ | ✔ | | facebook/opt-30b | ✔ | ✔ | ✔ | | meta-llama/Llama-2-7b-hf | ✔ | ✔ | ✔ | | meta-llama/Llama-2-13b-hf | ✔ | ✔ | ✔ | | meta-llama/Llama-2-70b-hf | ✔ | ✔ | ✔ | | tiiuae/falcon-7b | ✔ | ✔ | ✔ | | tiiuae/falcon-40b | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-13B-Chat | ✔ | ✔ | ✔ | | baichuan-inc/Baichuan2-7B-Chat | ✔ | ✔ | ✔ | | bigscience/bloom-1b7 | ✔ | ✔ | ✔ | | databricks/dolly-v2-12b | ✖ | ✔ | ✖ | | EleutherAI/gpt-neox-20b | ✖ | ✔ | ✔ | | mistralai/Mistral-7B-v0.1 | ✖ | ✔ | ✔ | | THUDM/chatglm2-6b | WIP | ✔ | ✔ | | THUDM/chatglm3-6b | WIP | ✔ | ✔ | **Detail recipes can be found [HERE](https://github.com/intel/intel-extension-for-transformers/blob/main/examples/huggingface/pytorch/text-generation/quantization/llm_quantization_recipes.html).** > Notes: > > - This model list comes from [IPEX](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/tutorials/llm.html). > - The WIP recipes will be published soon. ## Large Language Models Accuracy

Model	lambada_openai
	FP32	SQ INT8		WOQ INT8		WOQ INT4 GPTQ		WOQ INT4 AutoRound
	ACC	ACC	Ratio	ACC	Ratio	ACC	Ratio	ACC	Ratio
baichuan-inc/Baichuan-13B-Chat	67.57%	68.23%	1.0098	67.57%	1.0000	67.84%	1.0040	NA	NA
baichuan-inc/Baichuan2-13B-Chat	71.51%	70.89%	0.9913	71.53%	1.0003	71.76%	1.0035	NA	NA
baichuan-inc/Baichuan2-7B-Chat	67.67%	67.96%	1.0043	67.59%	0.9988	67.24%	0.9936	67.42%	0.9963
bigscience/bloom-1b7	46.34%	47.99%	1.0356	46.38%	1.0009	46.19%	0.9968	NA	NA
databricks/dolly-v2-12b	64.35%	NA	NA	64.10%	0.9961	NA	NA	NA	NA
EleutherAI/gpt-j-6b	68.31%	68.33%	1.0003	68.23%	0.9988	68.79%	1.0070	68.43%	1.0018
EleutherAI/gpt-neox-20b	72.33%	NA	NA	72.25%	0.9989	71.96%	0.9949	NA	NA
facebook/opt-1.3b	57.89%	57.54%	0.9940	58.08%	1.0033	58.57%	1.0117	NA	NA
facebook/opt-30b	71.49%	71.51%	1.0003	71.51%	1.0003	71.82%	1.0046	72.11%	1.0087
meta-llama/Llama-2-13b-hf	76.77%	76.25%	0.9932	76.75%	0.9997	77.43%	1.0086	76.75%	0.9997
meta-llama/Llama-2-70b-hf	79.64%	79.55%	0.9989	79.57%	0.9991	80.09%	1.0057	79.97%	1.0041
meta-llama/Llama-2-7b-hf	73.92%	73.45%	0.9936	73.96%	1.0005	73.45%	0.9936	73.49%	0.9942
mistralai/Mistral-7B-v0.1	75.90%	NA	NA	75.80%	0.9987	76.13%	1.0030	75.61%	0.9962
THUDM/chatglm2-6b	53.23%	NA	NA	53.19%	0.9992	52.77%	0.9914	53.35%	1.0023
THUDM/chatglm3-6b	59.09%	NA	NA	59.01%	0.9986	NA	NA	58.61%	0.9919
tiiuae/falcon-40b	77.22%	77.04%	0.9977	77.22%	1.0000	77.94%	1.0093	78.79%	1.0203
tiiuae/falcon-7b	74.67%	76.44%	1.0237	74.77%	1.0013	75.00%	1.0044	NA	NA