Layer Wise Quantization (LWQ) ===== 1. [Introduction](#introduction) 2. [Supported Framework Model Matrix](#supported-framework-model-matrix) 3. [Examples](#examples) ## Introduction Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.

*Figure 1: The process of layer-wise quantization for PyTorch model. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*

*Figure 2: The process of layer-wise quantization for ONNX model. The graph of LLM is split into several parts, and each subgraph is quantized in turn.* ## Supported Framework Model Matrix

Types/Framework		PyTorch	ONNX Runtime
W8A8 Post Training Static Quantization		✔	✔
Weight-only Quantization	RTN	✔	✕
	AWQ	✕
	GPTQ	✔
	TEQ	✕

## Examples #### PyTorch framework example ```python from neural_compressor import PostTrainingQuantConfig, quantization from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_empty_model fp32_model = load_empty_model(model_name_or_path, torchscript=True) conf = PostTrainingQuantConfig( approach="weight_only", recipes={ "layer_wise_quant": True, "rtn_args": {"enable_full_range": True}, }, ) q_model = quantization.fit( fp32_model, conf, calib_dataloader=eval_dataloader, eval_func=lambda x: 0.1, ) ouput_dir = "./saved_model" q_model.save(ouput_dir) q_model = load(ouput_dir, fp32_model, weight_only=True, layer_wise=True) ``` #### ONNX Runtime framework example ```python from neural_compressor import quantization, PostTrainingQuantConfig conf = PostTrainingQuantConfig(recipes={"layer_wise_quant": True}) q_model = quantization.fit(fp32_model_path, conf, calib_dataloader=dataloader) q_model.save(int8_model_path) ``` Refer to [ONNX Runtime llama-2 LWQ example](../../examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/weight_only)