Quick Start

The following instructions assume you have installed the Intel® Extension for PyTorch*. For installation instructions, refer to Installation.

To start using the Intel® Extension for PyTorch* in your code, you need to make the following changes:

Import the extension with import intel_extension_for_pytorch as ipex.
Invoke the optimize() function to apply optimizations.
Convert the eager mode model to a graph mode model.
- For TorchScript, invoke torch.jit.trace() and torch.jit.freeze()
- For TorchDynamo, invoke torch.compile(model, backend="ipex")(Beta feature)

Important: It is highly recommended to import intel_extension_for_pytorch right after import torch, prior to importing other packages.

The example below demostrates how to use the Intel® Extension for PyTorch* with TorchScript:

import torch
############## import ipex ###############
import intel_extension_for_pytorch as ipex
##########################################

model = Model()
model.eval()
data = ...

############## TorchScript ###############
model = ipex.optimize(model, dtype=torch.bfloat16)

with torch.no_grad(), torch.cpu.amp.autocast():
  model = torch.jit.trace(model, data)
  model = torch.jit.freeze(model)
  model(data)
##########################################

The example below demostrates how to use the Intel® Extension for PyTorch* with TorchDynamo:

import torch
############## import ipex ###############
import intel_extension_for_pytorch as ipex
##########################################

model = Model()
model.eval()
data = ...

############## TorchDynamo ###############
model = ipex.optimize(model, weights_prepack=False)

model = torch.compile(model, backend="ipex")
with torch.no_grad():
  model(data)
##########################################

More examples, including training and usage of low precision data types are available in the Examples section.

In Cheat Sheet, you can find more commands that can help you start using the Intel® Extension for PyTorch*.

LLM Quick Start

ipex.llm.optimize is used for Large Language Models (LLM).

import torch
#################### code changes ####################  
import intel_extension_for_pytorch as ipex
######################################################  
import argparse
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
)

# args
parser = argparse.ArgumentParser("Generation script (fp32/bf16 path)", add_help=False)
parser.add_argument(
    "--dtype",
    type=str,
    choices=["float32", "bfloat16"],
    default="float32",
    help="choose the weight dtype and whether to enable auto mixed precision or not",
)
parser.add_argument(
    "--max-new-tokens", default=32, type=int, help="output max new tokens"
)
parser.add_argument(
    "--prompt", default="What are we having for dinner?", type=str, help="input prompt"
)
parser.add_argument("--greedy", action="store_true")
parser.add_argument("--batch-size", default=1, type=int, help="batch size")
args = parser.parse_args()
print(args)

# dtype
amp_enabled = True if args.dtype != "float32" else False
amp_dtype = getattr(torch, args.dtype)

# load model
model_id = MODEL_ID
config = AutoConfig.from_pretrained(
    model_id, torchscript=True, trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=amp_dtype,
    config=config,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)
model = model.eval()
model = model.to(memory_format=torch.channels_last)

# Intel(R) Extension for PyTorch*
#################### code changes ####################  # noqa F401
model = ipex.llm.optimize(
    model,
    dtype=amp_dtype,
    inplace=True,
    deployment_mode=True,
)
######################################################  # noqa F401

# generate args
num_beams = 1 if args.greedy else 4
generate_kwargs = dict(do_sample=False, temperature=0.9, num_beams=num_beams)

# input prompt
prompt = args.prompt
input_size = tokenizer(prompt, return_tensors="pt").input_ids.size(dim=1)
print("---- Prompt size:", input_size)
prompt = [prompt] * args.batch_size

# inference
with torch.inference_mode(), torch.cpu.amp.autocast(enabled=amp_enabled):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    gen_ids = model.generate(
        input_ids,
        max_new_tokens=args.max_new_tokens,
        **generate_kwargs
    )
    gen_text = tokenizer.batch_decode(gen_ids, skip_special_tokens=True)
    input_tokens_lengths = [x.shape[0] for x in input_ids]
    output_tokens_lengths = [x.shape[0] for x in gen_ids]
    total_new_tokens = [
        o - i for i, o in zip(input_tokens_lengths, output_tokens_lengths)
    ]
    print(gen_text, total_new_tokens, flush=True)

More LLM examples, including usage of low precision data types are available in the LLM Examples section.