Features

Operator Optimization

Intel® Extension for TensorFlow* optimizes operators in CPU and implements all GPU operators with Intel® oneAPI DPC++ Compiler. Users can get these operator optimization benefits by default without any additional setting.

Besides, several customized operators for performance boost with itex.ops namespace are developed to extend TensorFlow public APIs implementation for better performance. Please refer to Customized OPs for details.

Graph Optimization

Intel® Extension for TensorFlow* provides graph optimization to fuse specified op pattern to new single op for better performance, such as Conv2D+ReLU, Linear+ReLU, etc. The benefit of the fusions are delivered to users in a transparant fashion.

Users can get the graph optimization benefits by default without any additional setting. Please refer to Graph Optimization for details.

Advanced Auto Mixed Precision (AMP)

Low precision data type bfloat16 and float16 are natively supported from the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set and Intel® Data Center GPU with further boosted performance and with less memory consumption. The lower-precision data types support of Advanced Auto Mixed Precision (AMP) are fully enabled in Intel® Extension for TensorFlow*.

Please refer to Advanced Auto Mixed Precision for details.

Ease-of-use Python API

Generally, the default configuration of Intel® Extension for TensorFlow* can get the good performance without any code changes. At the same time, Intel® Extension for TensorFlow* also provides simple frontend Python APIs and utilities for advanced users to get more performance optimizations with minor code changes for different kinds of application scenarios. Typically, only two to three clauses are required to be added to the original code.

Please check Python APIs page for details of API functions and Environment Variables page for environment setting.

GPU Profiler

Intel® Extension for TensorFlow* provides support for TensorFlow* profiler with almost same with TensorFlow Profiler(https://www.tensorflow.org/guide/profiler), one more thing to enable the profiler is exposing three environment variables (export ZE_ENABLE_TRACING_LAYER=1, export UseCyclesPerSecondTimer=1, export ENABLE_TF_PROFILER=1).

Please refer to GPU Profiler for details.

CPU Launcher [Experimental]

There are several factors that influence performance. Setting configuration options properly contributes to a performance boost. However, there is no unified configuration that is optimal to all topologies. Users need to try different combinations.

Intel® Extension for TensorFlow* provides a CPU launcher to automate these configuration settings to free users from the complicated work. This guide helps you to learn the launch script common usage and provides examples that cover many optimized configuration cases as well.

Please refer to CPU Launcher for details.

INT8 Quantization

Intel® Extension for TensorFlow* co-works with Intel® Neural Compressor(https://github.com/intel/neural-compressor) to provide compatible TensorFlow INT8 quantization solution support with same user experience.

Please refer to INT8 Quantization for details.

XPUAutoShard on GPU [Experimental]

Intel® Extension for TensorFlow* provides XPUAutoShard feature to automatically shard the input data and the TensorFlow graph, placing these data/graph shards on GPU devices to maximize the hardware usage.

Please refer to XPUAutoShard for details.

OpenXLA Support on GPU [Experimental]

Intel® Extension for TensorFlow* adopts a uniform Device API PJRT(https://github.com/openxla/community/blob/main/rfcs/20230123-pjrt-plugin.md) as the supported device plugin mechanism to implement Intel GPU backend for OpenXLA experimental support.

Please refer to OpenXLA_Support_on_GPU for details.