Quantization on Client
Introduction
For RTN
, and GPTQ
algorithms, we provide default algorithm configurations for different processor types (client
and sever
). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
Get Started
Here, we take the RTN
algorithm as example to demonstrate the usage on a client machine.
from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare
from neural_compressor.torch import load_empty_model
model_state_dict_path = "/path/to/model/state/dict"
float_model = load_empty_model(model_state_dict_path)
quant_config = get_default_rtn_config()
prepared_model = prepare(float_model, quant_config)
quantized_model = convert(prepared_model)
[!TIP] By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify
processor_type
as eitherclient
orserver
when callingget_default_rtn_config
.
For Windows machines, run the following command to utilize all available cores automatically:
python main.py
[!TIP] For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the
OMP_NUM_THREADS
explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores usingtaskset
.
RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,meta-llama/Llama-2-7b-chat-hf. However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.