Layer Wise Quantization (LWQ) ===== 1. [Introduction](#introduction) 2. [Supported Framework Model Matrix](#supported-framework-model-matrix) 3. [Examples](#examples) ## Introduction Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible. *Figure 1: The process of layer-wise quantization for PyTorch model. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.* *Figure 2: The process of layer-wise quantization for ONNX model. The graph of LLM is split into several parts, and each subgraph is quantized in turn.* ## Supported Framework Model Matrix
Types/Framework | PyTorch | ONNX Runtime | |
---|---|---|---|
W8A8 Post Training Static Quantization | ✔ | ✔ | |
Weight-only Quantization | RTN | ✔ | ✕ |
AWQ | ✕ | ||
GPTQ | ✔ | ||
TEQ | ✕ |