Codeless Optimization (Prototype)

This feature aims to get inference performance benefits from Intel® Extension for PyTorch* without changing code in your python scripts, which can raise Out-of-Box (OOB) experience to get started with Intel® Extension for PyTorch* easily. Users who already known how to apply optimizations with Intel® Extension for PyTorch* APIs are not targeted for this feature, due to the inevitable overhead and limitations we mentioned below.

Motivation

A typical use case of inference as in transformer can be simplified as the code snippet below:

import torch
model = Model().eval()
with torch.no_grad():
    for input in dataloader():
        model(**input)

To utilize optimizations of Intel® Extension for PyTorch* for optimum performance, several lines code changes are required/recommended.

import torch
import intel_extension_for_pytorch as ipex # clause added
model = Model().eval()
model = ipex.optimization(model)          # clause added
with torch.no_grad():
  with torch.cpu.amp.autocast():          # clause added for running with BFloat16 (Optional)
    input = ...                           # clause added for TorchScript (Optional, but recommended) 
    model = torch.jit.trace(input)        # clause added for TorchScript (Optional, but recommended) 
    model = torch.jit.freeze()            # clause added for TorchScript (Optional, but recommended) 
    for input in dataloader():
      model(**input)

With this feature, code changes above done manually are not required any more. Intel® Extension for PyTorch* optimizations will be applied automatically during execution in a monkey patch way.

Automatically import intel_extension_for_pytorch package: It applies Intel® Extension for PyTorch* optimizations, such as: torch.embedding_bag, torch.cpu.amp.autocast. It also registers Intel® Extension for PyTorch* JIT fusion pass and thus benefits the graph mode inference performance.
Automatically apply ipex.optimize() function. Only features enabled by default parameter values are supported, such as:
- Auto generate FX or Jit Graph.
- Auto Channel Last convert.
- Conv-Bn folding.
- Weight prepack.
- Replace dropout with identity.
- Optimize LSTM.
Automatically apply torch.cpu.amp.autocast with BFloat16 data type for inference.

Example Usage with HuggingFace

Let’s take the QA case in HuggingFace as an example.

The origin command with ipex launch

Here is the command to run with ipexrun.

clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/

Command to apply ipex optimization for FP32

Added --auto-ipex

clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/

Command to apply ipex optimization for BF16

Added --auto-ipex --dtype bfloat16

clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex --dtype bfloat16 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/

Use Case not supported

Module uses forward method explicitly instead of the `call` attr

import torch
class DummyModule(torch.nn.Module):
    def __init__(self,):
        super(DummyModule, self).__init__()
        self.input1 = torch.randn(1, 3, 224, 224)
        self.conv = torch.nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3))
        self.bn = torch.nn.BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)

    def forward(self, x):
        return self.bn(self.conv(x))

    def customized_forward(self, x):
        return self.bn(self.conv(x))

# Method1 will success
DummyModule()(input)
# Method2 will fail to apply ipex.optimize in the top-level model
DummyModule().customized_forward(input)

If a model uses forward method explicitly instead of the __call__ attr, we are unable to hook the execution of this model. As result, we are unable to auto apply the optimizations to this DummyModule().

Already using `ipex.optimize`

User already invokes ipex.optimize in script is not targeted for this feature. The behaviour as repeated invoking of ipex.optimize is not defined. The second invoking of ipex.optimize for the same module will fail with error message to avoid this behaviour.

Already using Jit Trace

For Jit trace case (as below example code) is not planned to support at first stage:

import torch
model = Model().eval()
traced_model = torch.jit.trace(model, x).eval()
traced_model = torch.jit.freeze(traced_model)
with torch.no_grad():
    for input in dataloader():
        traced_model(input)

For 2 reasons:

The auto graph mode support has already been included in ipex.optimize with graph first API in 1.13.
Extra launch parameters and Monkey patches are needed to support above case. We will focus on the feasibility of first use case in TorchVision and HuggingFace workloads.