LLMs Quantization Recipes

Intel® Neural Compressor supported advanced large language models (LLMs) quantization technologies including SmoothQuant (SQ) and Weight-Only Quant (WOQ), and verified a list of LLMs on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch, Intel® Extension for PyTorch and Intel® Extension for Transformers.
This document aims to publish the specific recipes we achieved for the popular LLMs and help users to quickly get an optimized LLM with limited 1% accuracy loss.

Notes:

The quantization algorithms provide by Intel® Neural Compressor and the evaluate functions provide by Intel® Extension for Transformers.

The model list are continuing update, please expect to find more LLMs in the future.

Large Language Models Recipes

Models	SQ INT8	WOQ INT8	WOQ INT4
EleutherAI/gpt-j-6b	✔	✔	✔
facebook/opt-1.3b	✔	✔	✔
facebook/opt-30b	✔	✔	✔
meta-llama/Llama-2-7b-hf	WIP	✔	✔
meta-llama/Llama-2-13b-hf	WIP	✔	✔
meta-llama/Llama-2-70b-hf	✔	✔	✔
tiiuae/falcon-7b	✔	✔	✔
tiiuae/falcon-40b	✔	✔	✔
baichuan-inc/Baichuan-13B-Chat	✔	✔	✔
baichuan-inc/Baichuan2-13B-Chat	✔	✔	✔
baichuan-inc/Baichuan2-7B-Chat	✔	✔	✔
bigscience/bloom-1b7	✔	✔	✔
databricks/dolly-v2-12b	✖	✔	✖
EleutherAI/gpt-neox-20b	✖	✔	✔
mistralai/Mistral-7B-v0.1	✖	✔	✔
THUDM/chatglm2-6b	WIP	✔	WIP
THUDM/chatglm3-6b	WIP	✔	✔

Detail recipes can be found HERE.

Notes:

This model list comes from IPEX.

The WIP recipes will be published soon.

Large Language Models Accuracy

Model	lambada_openai
	FP32	SQ INT8		WOQ INT8		WOQ INT4 GPTQ		WOQ INT4 AutoRound
	ACC	ACC	Ratio	ACC	Ratio	ACC	Ratio	ACC	Ratio
baichuan-inc/Baichuan-13B-Chat	67.57%	67.86%	1.0043	67.55%	0.9997	67.46%	0.9984	N/A	N/A
baichuan-inc/Baichuan2-13B-Chat	71.51%	75.51%	1.0559	71.57%	1.0008	71.45%	0.9992	70.87%	0.9911
baichuan-inc/Baichuan2-7B-Chat	67.67%	67.51%	0.9976	67.61%	0.9991	68.08%	1.0061	67.18%	0.9928
bigscience/bloom-1b7	46.34%	47.97%	1.0352	46.21%	0.9972	47.00%	1.0142	N/A	N/A
databricks/dolly-v2-12b	64.35%	N/A	N/A	63.92%	0.9933	N/A	N/A	N/A	N/A
EleutherAI/gpt-j-6b	68.31%	68.00%	0.9955	68.27%	0.9994	68.23%	0.9988	67.40%	0.9867
EleutherAI/gpt-neox-20b	72.33%	N/A	N/A	72.29%	0.9994	72.15%	0.9975	N/A	N/A
facebook/opt-1.3b	57.89%	57.35%	0.9907	58.12%	1.0040	58.01%	1.0021	N/A	N/A
facebook/opt-30b	71.49%	71.51%	1.0003	71.53%	1.0006	71.82%	1.0046	71.43%	0.9992
meta-llama/Llama-2-13b-hf	76.77%	N/A	N/A	76.89%	1.0016	76.96%	1.0025	N/A	N/A
meta-llama/Llama-2-70b-hf	79.64%	79.53%	0.9986	79.62%	0.9997	80.05%	1.0051	N/A	N/A
meta-llama/Llama-2-7b-hf	73.92%	N/A	N/A	73.90%	0.9997	73.51%	0.9945	N/A	N/A
mistralai/Mistral-7B-v0.1	75.90%	N/A	N/A	75.80%	0.9987	75.37%	0.9930	75.82%	0.9989
THUDM/chatglm2-6b	53.23%	N/A	N/A	53.00%	0.9957	N/A	N/A	N/A	N/A
THUDM/chatglm3-6b	59.09%	N/A	N/A	59.03%	0.9990	N/A	N/A	58.59%	0.9915
tiiuae/falcon-40b	77.22%	77.26%	1.0005	77.18%	0.9995	77.97%	1.0097	N/A	N/A
tiiuae/falcon-7b	74.67%	76.17%	1.0201	74.73%	1.0008	74.79%	1.0016	N/A	N/A