neural_compressor.torch.quantization.save_load_entry

Intel Neural Compressor PyTorch load entry for all algorithms.

Functions

save(model[, checkpoint_dir, format])

Save quantized model.

load(model_name_or_path[, original_model, format, device])

Load quantized model.

Module Contents

neural_compressor.torch.quantization.save_load_entry.save(model, checkpoint_dir='saved_results', format='default')[source]

Save quantized model.

Parameters:
  • model (torch.nn.module or TorchScript model with IPEX or fx graph with pt2e, optional) – Quantized model.

  • checkpoint_dir (str, optional) – checkpoint directory. Defaults to “saved_results”.

  • format (str, optional) – ‘defult’ for loading INC quantized model. ‘huggingface’ for loading huggingface WOQ causal language model. Defaults to “default”.

neural_compressor.torch.quantization.save_load_entry.load(model_name_or_path, original_model=None, format='default', device='cpu', **kwargs)[source]

Load quantized model.

  1. Load INC quantized model in local.
    case 1: WOQ

    from neural_compressor.torch.quantization import load load(model_name_or_path=”saved_results”, original_model=fp32_model)

    case 2: INT8/FP8

    from neural_compressor.torch.quantization import load load(model_name_or_path=’saved_result’, original_model=fp32_model)

    case 3: TorchScript (IPEX)

    from neural_compressor.torch.quantization import load load(model_name_or_path=’saved_result’)

  2. Load HuggingFace quantized model, including GPTQ models and upstreamed INC quantized models in HF model hub.
    case 1: WOQ

    from neural_compressor.torch.quantization import load load(model_name_or_path=model_name_or_path, format=”huggingface”)

Parameters:
  • model_name_or_path (str) – torch checkpoint directory or hugginface model_name_or_path. If ‘format’ is set to ‘huggingface’, it means the huggingface model_name_or_path. If ‘format’ is set to ‘default’, it means the ‘checkpoint_dir’. Parameter should not be None. it coworks with ‘original_model’ parameter to load INC quantized model in local.

  • original_model (torch.nn.module or TorchScript model with IPEX or fx graph with pt2e, optional) – original model before quantization. Needed if ‘format’ is set to ‘default’ and not TorchScript model. Defaults to None.

  • format (str, optional) – ‘defult’ for loading INC quantized model. ‘huggingface’ for loading huggingface WOQ causal language model. Defaults to “default”.

  • device (str, optional) – ‘cpu’, ‘hpu’. specify the device the model will be loaded to. currently only used for weight-only quantization.

  • kwargs (remaining dictionary of keyword arguments, optional) – remaining dictionary of keyword arguments for loading huggingface models. Will be passed to the huggingface model’s __init__ method, such as ‘trust_remote_code’, ‘revision’.

Returns:

The quantized model