Quantization on Client
Introduction
For RTN, and GPTQ algorithms, we provide default algorithm configurations for different processor types (client and sever). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
Get Started
Here, we take the RTN algorithm as example to demonstrate the usage on a client machine.
from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
from neural_compressor.torch import load_empty_model
model_state_dict_path = "/path/to/model/state/dict"
float_model = load_empty_model(model_state_dict_path)
quant_config = get_default_rtn_config()
prepared_model = prepare(float_model, quant_config)
quantized_model = convert(prepared_model)
[!TIP] By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify
processor_typeas eitherclientorserverwhen callingget_default_rtn_config.
For Windows machines, run the following command to utilize all available cores automatically:
python main.py
[!TIP] For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the
OMP_NUM_THREADSexplicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores usingtaskset.
RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,meta-llama/Llama-2-7b-chat-hf. However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.