Quantization on Client

  1. Introduction

  2. Get Started

Introduction

For RTN, and GPTQ algorithms, we provide default algorithm configurations for different processor types (client and sever). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.

Get Started

Here, we take the RTN algorithm as example to demonstrate the usage on a client machine.

from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
from neural_compressor.torch import load_empty_model

model_state_dict_path = "/path/to/model/state/dict"
float_model = load_empty_model(model_state_dict_path)
quant_config = get_default_rtn_config()
prepared_model = prepare(float_model, quant_config)
quantized_model = convert(prepared_model)

[!TIP] By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify processor_type as either client or server when calling get_default_rtn_config.

For Windows machines, run the following command to utilize all available cores automatically:

python main.py

[!TIP] For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the OMP_NUM_THREADS explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using taskset.

RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,meta-llama/Llama-2-7b-chat-hf. However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.