Codeless Optimization (Prototype) ==================================== This feature aims to get inference performance benefits from Intel® Extension for PyTorch\* without changing code in your python scripts, which can raise Out-of-Box (OOB) experience to get started with Intel® Extension for PyTorch\* easily. Users who already known how to apply optimizations with Intel® Extension for PyTorch\* APIs are not targeted for this feature, due to the inevitable overhead and limitations we mentioned below. ## Motivation A typical use case of inference as in [transformer](https://github.com/huggingface/transformers/blob/v4.21.1/src/transformers/trainer.py#L3187) can be simplified as the code snippet below: ``` import torch model = Model().eval() with torch.no_grad(): for input in dataloader(): model(**input) ``` To utilize optimizations of Intel® Extension for PyTorch\* for optimum performance, several lines code changes are required/recommended. ``` import torch import intel_extension_for_pytorch as ipex # clause added model = Model().eval() model = ipex.optimization(model) # clause added with torch.no_grad(): with torch.cpu.amp.autocast(): # clause added for running with BFloat16 (Optional) input = ... # clause added for TorchScript (Optional, but recommended) model = torch.jit.trace(input) # clause added for TorchScript (Optional, but recommended) model = torch.jit.freeze() # clause added for TorchScript (Optional, but recommended) for input in dataloader(): model(**input) ``` With this feature, code changes above done manually are not required any more. Intel® Extension for PyTorch\* optimizations will be applied automatically during execution in a monkey patch way. * Automatically import `intel_extension_for_pytorch` package: It applies Intel® Extension for PyTorch\* optimizations, such as: `torch.embedding_bag`, `torch.cpu.amp.autocast`. It also registers Intel® Extension for PyTorch\* JIT fusion pass and thus benefits the graph mode inference performance. * Automatically apply `ipex.optimize()` function. Only features enabled by default parameter values are supported, such as: * Auto generate FX or Jit Graph. * Auto Channel Last convert. * Conv-Bn folding. * Weight prepack. * Replace dropout with identity. * Optimize LSTM. * Automatically apply `torch.cpu.amp.autocast` with BFloat16 data type for inference. ## Example Usage with HuggingFace Let's take the [QA case](https://github.com/huggingface/transformers/tree/main/examples/pytorch/question-answering) in HuggingFace as an example. ### The origin command with ipex launch Here is the command to run with [`ipexrun`](../performance_tuning/launch_script.md). ``` clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/ ``` ### Command to apply ipex optimization for FP32 Added `--auto-ipex` ``` clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/ ``` ### Command to apply ipex optimization for BF16 Added `--auto-ipex --dtype bfloat16` ``` clear && ipexrun --memory-allocator default --ninstances 2 --ncores-per-instance 28 --auto-ipex --dtype bfloat16 run_qa.py --model_name_or_path bert-base-uncased --dataset_name squad --do_eval --per_device_train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir /tmp/debug_squad/ ``` ## Use Case not supported ### Module uses forward method explicitly instead of the `__call__` attr ``` import torch class DummyModule(torch.nn.Module): def __init__(self,): super(DummyModule, self).__init__() self.input1 = torch.randn(1, 3, 224, 224) self.conv = torch.nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3)) self.bn = torch.nn.BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) def forward(self, x): return self.bn(self.conv(x)) def customized_forward(self, x): return self.bn(self.conv(x)) # Method1 will success DummyModule()(input) # Method2 will fail to apply ipex.optimize in the top-level model DummyModule().customized_forward(input) ``` If a model uses forward method explicitly instead of the `__call__` attr, we are unable to hook the execution of this model. As result, we are unable to auto apply the optimizations to this `DummyModule()`. ### Already using `ipex.optimize` User already invokes `ipex.optimize` in script is not targeted for this feature. The behaviour as repeated invoking of `ipex.optimize` is not defined. The second invoking of `ipex.optimize` for the same module will fail with error message to avoid this behaviour. ### Already using Jit Trace For Jit trace case (as below example code) is not planned to support at first stage: ``` import torch model = Model().eval() traced_model = torch.jit.trace(model, x).eval() traced_model = torch.jit.freeze(traced_model) with torch.no_grad(): for input in dataloader(): traced_model(input) ``` For 2 reasons: * The auto graph mode support has already been included in `ipex.optimize` with graph first API in 1.13. * Extra launch parameters and Monkey patches are needed to support above case. We will focus on the feasibility of first use case in TorchVision and HuggingFace workloads.