API Documentation
General
- ipex.optimize(model, dtype=None, optimizer=None, level='O1', inplace=False, conv_bn_folding=None, weights_prepack=None, replace_dropout_with_identity=None, optimize_lstm=None, split_master_weight_for_bf16=None, fuse_update_step=None, auto_kernel_selection=None)
Apply optimizations at Python frontend to the given model (nn.Module), as well as the given optimizer (optional). If the optimizer is given, optimizations will be applied for training. Otherwise, optimization will be applied for inference. Optimizations include
conv+bn
folding (for inference only), weight prepacking and so on.Weight prepacking is a technique to accelerate performance of oneDNN operators. In order to achieve better vectorization and cache reuse, onednn uses a specific memory layout called
blocked layout
. Although the calculation itself withblocked layout
is fast enough, from memory usage perspective it has drawbacks. Running with theblocked layout
, oneDNN splits one or several dimensions of data into blocks with fixed size each time the operator is executed. More details information about oneDNN data mermory format is available at oneDNN manual. To reduce this overhead, data will be converted to predefined block shapes prior to the execution of oneDNN operator execution. In runtime, if the data shape matches oneDNN operator execution requirements, oneDNN won’t perform memory layout conversion but directly go to calculation. Through this methodology, calledweight prepacking
, it is possible to avoid runtime weight data format convertion and thus increase performance.- Parameters
model (torch.nn.Module) – User model to apply optimizations on.
dtype (torch.dtype) – Only works for
torch.bfloat16
. Model parameters will be casted totorch.bfloat16
if dtype is set totorch.bfloat16
. The default value is None, meaning do nothing. Note: Data type conversion is only applied tonn.Conv2d
,nn.Linear
andnn.ConvTranspose2d
for both training and inference cases. For inference mode, additional data type conversion is applied to the weights ofnn.Embedding
andnn.LSTM
.optimizer (torch.optim.Optimizer) – User optimizer to apply optimizations on, such as SGD. The default value is
None
, meaning inference case.level (string) –
"O0"
or"O1"
. No optimizations are applied with"O0"
. The optimizer function just returns the original model and optimizer. With"O1"
, the following optimizations are applied: conv+bn folding, weights prepack, dropout removal (inferenc model), master weight split and fused optimizer update step (training model). The optimization options can be further overridden by setting the following options explicitly. The default value is"O1"
.inplace (bool) – Whether to perform inplace optimization. Default value is
False
.conv_bn_folding (bool) – Whether to perform
conv_bn
folding. It only works for inference model. The default value isNone
. Explicitly setting this knob overwrites the configuration set bylevel
knob.weights_prepack (bool) – Whether to perform weight prepack for convolution and linear to avoid oneDNN weights reorder. The default value is
None
. Explicitly setting this knob overwrites the configuration set bylevel
knob.replace_dropout_with_identity (bool) – Whether to replace
nn.Dropout
withnn.Identity
. If replaced, theaten::dropout
won’t be included in the JIT graph. This may provide more fusion opportunites on the graph. This only works for inference model. The default value isNone
. Explicitly setting this knob overwrites the configuration set bylevel
knob.optimize_lstm (bool) – Whether to replace
nn.LSTM
withIPEX LSTM
which takes advantage of oneDNN kernels to get better performance. The default value isNone
. Explicitly setting this knob overwrites the configuration set bylevel
knob.split_master_weight_for_bf16 (bool) – Whether to split master weights update for BF16 training. This saves memory comparing to master weight update solution. Split master weights update methodology doesn’t support all optimizers. The default value is None. The default value is
None
. Explicitly setting this knob overwrites the configuration set bylevel
knob.fuse_update_step (bool) – Whether to use fused params update for training which have better performance. It doesn’t support all optimizers. The default value is
None
. Explicitly setting this knob overwrites the configuration set bylevel
knob.auto_kernel_selection (bool) – Different backends may have different performances with different dtypes/shapes. Default value is False. Intel® Extension for PyTorch* will try to optimize the kernel selection for better performance if this knob is set to
True
. There might be regressions at current stage. The default value isNone
. Explicitly setting this knob overwrites the configuration set bylevel
knob.
- Returns
Model and optimizer (if given) modified according to the
level
knob or other user settings.conv+bn
folding may take place anddropout
may be replaced byidentity
. In inference scenarios, convolutuon, linear and lstm will be replaced with the optimized counterparts in Intel® Extension for PyTorch* (weight prepack for convolution and linear) for good performance. In bfloat16 scenarios, parameters of convolution and linear will be casted to bfloat16 dtype.
Warning
Please invoke
optimize
function before invoking DDP in distributed training scenario.The
optimize
function deepcopys the original model. If DDP is invoked beforeoptimize
function, DDP is applied on the origin model, rather than the one returned fromoptimize
function. In this case, some operators in DDP, like allreduce, will not be invoked and thus may cause unpredictable accuracy loss.Examples
>>> # bfloat16 inference case. >>> model = ... >>> model.eval() >>> optimized_model = ipex.optimize(model, dtype=torch.bfloat16) >>> # running evaluation step. >>> # bfloat16 training case. >>> optimizer = ... >>> model.train() >>> optimized_model, optimized_optimizer = ipex.optimize(model, dtype=torch.bfloat16, optimizer=optimizer) >>> # running training step.
- ipex.enable_onednn_fusion(enabled)
Enables or disables oneDNN fusion functionality. If enabled, oneDNN operators will be fused in runtime, when intel_extension_for_pytorch is imported.
- Parameters
enabled (bool) – Whether to enable oneDNN fusion functionality or not. Default value is
True
.
Examples
>>> import intel_extension_for_pytorch as ipex >>> # to enable the oneDNN fusion >>> ipex.enable_onednn_fusion(True) >>> # to disable the oneDNN fusion >>> ipex.enable_onednn_fusion(False)
- class ipex.verbose(level)
On-demand oneDNN verbosing functionality
To make it easier to debug performance issues, oneDNN can dump verbose messages containing information like kernel size, input data size and execution duration while executing the kernel. The verbosing functionality can be invoked via an environment variable named DNNL_VERBOSE. However, this methodology dumps messages in all steps. Those are a large amount of verbose messages. Moreover, for investigating the performance issues, generally taking verbose messages for one single iteration is enough.
This on-demand verbosing functionality makes it possible to control scope for verbose message dumping. In the following example, verbose messages will be dumped out for the second inference only.
import intel_extension_for_pytorch as ipex model(data) with ipex.verbose(ipex.VERBOSE_ON): model(data)
- Parameters
level –
Verbose level
VERBOSE_OFF
: Disable verbosingVERBOSE_ON
: Enable verbosingVERBOSE_ON_CREATION
: Enable verbosing, including oneDNN kernel creation
Quantization
- ipex.quantization.QuantConf(configure_file=None, qscheme=torch.per_tensor_affine)
Configure setting for INT8 quantization flow.
- Parameters
configure_file (string) – The INT8 configure file(.json file) to be loaded or saved.
qscheme (torch.qscheme) – quantization scheme to be used(activation)
Available configurations in the configure_file are:
id (int): The number of quantized ops in the model running flow. Note: only limited ops are reordered, such as convolution, linear or other ops.
name (string): Quantized OP’s name.
algorithm (string): observe method for activation tensors during calibration. Only support min-max now, more methods will be support in future.
weight_granularity (Qscheme): Qscheme for weight quantizer for convolution and linear, can be per_channel or per_tesor, user can manually set it before load existed configure file. The default value is uint8.
input_scales: Scales for inputs.
input_zero_points: Zero points for inputs.
output_scales”: Scales for outputs.
output_zero_points: Zero points for outputs.
weight_scales: Scales for Weights.
input_quantized_dtypes: Quantized dtypes fot inputs, can be uint8 or int8, user can manually set it before load existed configure file. The default value is uint8.
output_quantized_dtypes: Quantized dtypes fot ouputs, can be uint8 or int8, user can manually set it before load existed configure file. The default value is uint8.
inputs_quantized: Whether inputs need quantized, can be true or false, user can manually set it before load existed configure file.
outputs_quantized: Whether output need quantized, can be true or false, user can manually set it before load existed configure file.
inputs_flow: Where the inputs are from, beacuse we only record limited ops, we can know which ops are adjacent by compare one inputs flow with others’ output flow.
outputs_flow: Outputs flag for current op, which can be used to check which ops are adjacent.
Warning
qscheme
can only take one of the following options:torch.per_tensor_affine
torch.per_tensor_symmetric
Note
The loaded or saved json file will be has the content like:
[{“id”: 0,“name”: “conv2d”,“algorithm”: “min_max”,“weight_granularity”: “per_channel”,“input_scales”: [0.01865844801068306],“input_zero_points”: [114],“output_scales”: [0.05267734453082085],“output_zero_points”: [132],“weight_scales”: [[0.0006843071314506233,0.0005326663958840072,0.00016389577649533749,]],“input_quantized_dtypes”: [“uint8”],“output_quantized_dtypes”: [“uint8”],“inputs_quantized”: [true],“outputs_quantized”: [false],“inputs_flow”: [“conv2d.0.input”],“outputs_flow”: [“conv2d.0.output”]}]
- class ipex.quantization.calibrate(conf, default_recipe=True)
Enable quantization calibration scope which will collect the scales and zero points of ops to be quantized, such as convolution, linear.
- Parameters
conf (quantization.QuantConf) – Quantization’s setting.
default_recipe (bool) – Whether produce a default Quantization’s setting, which will do a post-process to remove reduent quantizer, which can be True of False. For example, conv+relu, quantizers are inserted before and after conv and relu, the data flow will be like: quant-> dequant->conv->quant->dequant->quant->dequant->relu->quant->dequant, if default_recipe is True, the data flow will be converted to: quant ->dequant->conv->relu->quant->dequant, the quantizers will not be inserted after conv’s output and before relu’s input. The deault value is
True
.
- ipex.quantization.convert(model, conf, inputs)
Convert an FP32 torch.nn.Module model to a quantized JIT ScriptModule according to the given quantization recipes in the quantization configuration conf.
The function will conduct a JIT trace with the given inputs. It will fail if the given model doesn’t support JIT trace.
- Parameters
model (torch.nn.Module) – The FP32 model to be convert.
conf (quantization.QuantConf) – Quantization’s setting.
inputs (tuple or torch.Tensor) – A tuple of example inputs that are used for JIT trace of the given model.
- Returns
torch.jit.ScriptModule
CPU Runtime
- ipex.cpu.runtime.is_runtime_ext_enabled()
Helper function to check whether runtime extension is enabled or not.
- Parameters
None (None) – None
- Returns
Whether the runtime exetension is enabled or not.
- Return type
bool
- class ipex.cpu.runtime.CPUPool(core_ids: Optional[list] = None, node_id: Optional[int] = None)
An abstraction of a pool of CPU cores used for intra-op parallelism.
- Parameters
core_ids (list) – A list of CPU cores’ ids used for intra-op parallelism.
node_id (int) – A numa node id with all CPU cores on the numa node.
node_id
doesn’t work ifcore_ids
is set.
- Returns
Generated ipex.cpu.runtime.CPUPool object.
- Return type
- class ipex.cpu.runtime.pin(cpu_pool: ipex.cpu.runtime.cpupool.CPUPool)
Apply the given CPU pool to the master thread that runs the scoped code region or the function/method def.
- Parameters
cpu_pool (ipex.cpu.runtime.CPUPool) – ipex.cpu.runtime.CPUPool object, contains all CPU cores used by the designated operations.
- Returns
Generated ipex.cpu.runtime.pin object which can be used as a with context or a function decorator.
- Return type
- class ipex.cpu.runtime.MultiStreamModule(model, num_streams: int, cpu_pool: ipex.cpu.runtime.cpupool.CPUPool)
MultiStreamModule supports inference with multi-stream throughput mode.
If the number of cores inside
cpu_pool
is divisible bynum_streams
, the cores will be allocated equally to each stream.If the number of cores inside
cpu_pool
is not divisible bynum_streams
with remainder N, one extra core will be allocated to the first N streams.- Parameters
model (torch.jit.ScriptModule or torch.nn.Module) – The input model.
num_streams (int) – Number of instances.
cpu_pool (ipex.cpu.runtime.CPUPool) – An ipex.cpu.runtime.CPUPool object, contains all CPU cores used to run multi-stream inference.
- Returns
Generated ipex.cpu.runtime.MultiStreamModule object.
- Return type
- class ipex.cpu.runtime.Task(module, cpu_pool: ipex.cpu.runtime.cpupool.CPUPool)
An abstraction of computation based on PyTorch module and is scheduled asynchronously.
- Parameters
model (torch.jit.ScriptModule or torch.nn.Module) – The input module.
cpu_pool (ipex.cpu.runtime.CPUPool) – An ipex.cpu.runtime.CPUPool object, contains all CPU cores used to run Task asynchronously.
- Returns
Generated ipex.cpu.runtime.Task object.
- Return type
- ipex.cpu.runtime.get_core_list_of_node_id(node_id)
Helper function to get the CPU cores’ ids of the input numa node.
- Parameters
node_id (int) – Input numa node id.
- Returns
List of CPU cores’ ids on this numa node.
- Return type
list