neural_compressor.torch.quantization.save_load_entry

Intel Neural Compressor PyTorch load entry for all algorithms.

Functions

`save`(model[, checkpoint_dir, format])	Save quantized model.
`load`(model_name_or_path[, original_model, format, device])	Load quantized model.

Module Contents

neural_compressor.torch.quantization.save_load_entry.save(model, checkpoint_dir='saved_results', format='default')[source]

Save quantized model.

Parameters:

model (torch.nn.module or TorchScript model with IPEX or fx graph with pt2e, optional) – Quantized model.
checkpoint_dir (str, optional) – checkpoint directory. Defaults to “saved_results”.
format (str, optional) – ‘defult’ for loading INC quantized model. ‘huggingface’ for loading huggingface WOQ causal language model. Defaults to “default”.

neural_compressor.torch.quantization.save_load_entry.load(model_name_or_path, original_model=None, format='default', device='cpu', **kwargs)[source]

Load quantized model.

Load INC quantized model in local.

case 1: WOQ
from neural_compressor.torch.quantization import load load(model_name_or_path=”saved_results”, original_model=fp32_model)

case 2: INT8/FP8
from neural_compressor.torch.quantization import load load(model_name_or_path=’saved_result’, original_model=fp32_model)

case 3: TorchScript (IPEX)
from neural_compressor.torch.quantization import load load(model_name_or_path=’saved_result’)
Load HuggingFace quantized model, including GPTQ models and upstreamed INC quantized models in HF model hub.

case 1: WOQ
from neural_compressor.torch.quantization import load load(model_name_or_path=model_name_or_path, format=”huggingface”)

Parameters:

model_name_or_path (str) – torch checkpoint directory or hugginface model_name_or_path. If ‘format’ is set to ‘huggingface’, it means the huggingface model_name_or_path. If ‘format’ is set to ‘default’, it means the ‘checkpoint_dir’. Parameter should not be None. it coworks with ‘original_model’ parameter to load INC quantized model in local.
original_model (torch.nn.module or TorchScript model with IPEX or fx graph with pt2e, optional) – original model before quantization. Needed if ‘format’ is set to ‘default’ and not TorchScript model. Defaults to None.
format (str, optional) – ‘defult’ for loading INC quantized model. ‘huggingface’ for loading huggingface WOQ causal language model. Defaults to “default”.
device (str, optional) – ‘cpu’, ‘hpu’. specify the device the model will be loaded to. currently only used for weight-only quantization.
kwargs (remaining dictionary of keyword arguments, optional) – remaining dictionary of keyword arguments for loading huggingface models. Will be passed to the huggingface model’s __init__ method, such as ‘trust_remote_code’, ‘revision’.

Returns:

The quantized model