Welcome to Intel® Extension for TensorFlow* documentation

Documentation

Overview
Infrastructure Quick example Examples Releases
Performance data Frequently asked questions Contributing guidelines
Installation guide
Install for CPU Install for XPU Install by source build Install Conda for GPU distributed
Features
Environment variables Python API Next Pluggable Device CPU Thread Pool
Graph optimization Custom operator Advanced auto mixed precision Operator override
INT8 quantization XPUAutoShard GPU profiler CPU launcher Weight prepack
Advanced topics
CPU practice guide GPU practice guide C++ API support OpenXLA Keras 3
Developer Guide
Extension design Directory structure Optimizations design Custom Op

Highlights

  • Environment variables & Python API

    Generally, the default configuration of Intel® Extension for TensorFlow* provides good performance without any code changes. Intel® Extension for TensorFlow* also provides simple frontend Python APIs and utilities for advanced users to get more optimized performance with only minor code changes for different kinds of application scenarios. Typically, you only need to add two or three clauses to the original code.

  • Next Pluggable Device (NPD)

    The Next Pluggable Device (NPD) represents an advanced generation of TensorFlow plugin mechanisms. It not only facilitates a seamless integration of new accelerator plugins for registering devices with TensorFlow without requiring modifications to the TensorFlow codebase, but it also serves as a conduit to OpenXLA via its PJRT plugin. This innovative approach significantly streamlines the process of extending TensorFlow’s capabilities with new hardware accelerators, enhancing both efficiency and flexibility.

  • Advanced auto mixed precision (AMP)

    Low precision data types bfloat16 and float16 are natively supported by the 3rd Generation Xeon® Scalable Processors, codenamed Cooper Lake, with AVX512 instruction set and the Intel® Data Center GPU, which further boosts performance and uses less memory. The lower-precision data types supported by Advanced Auto Mixed Precision (AMP) are fully enabled in Intel® Extension for TensorFlow*.

  • Graph optimization

    Intel® Extension for TensorFlow* provides graph optimization to fuse specific operator patterns to a new single operator for better performance, such as Conv2D+ReLU or Linear+ReLU. The benefits of the fusions are delivered to users in a transparent fashion.

  • CPU Thread Pool

    Intel® Extension for TensorFlow* uses OMP thread pool by default since it has better performance and scaling for most cases. For workloads with large inter-op concurrency, you can switch to use Eigen thread pool (default in TensorFlow) by setting the environment variable ITEX_OMP_THREADPOOL=0.

  • Operator optimization

    Intel® Extension for TensorFlow* also optimizes operators and implements several customized operators for a performance boost. The itex.ops namespace is used to extend TensorFlow public APIs implementation for better performance.

  • GPU profiler

    Intel® Extension for TensorFlow* provides support for TensorFlow Profiler. To enable the profiler, define three environment variables ( export ZE_ENABLE_TRACING_LAYER=1, export UseCyclesPerSecondTimer=1, export ENABLE_TF_PROFILER=1)

  • INT8 quantization

    Intel® Extension for TensorFlow* co-works with Intel® Neural Compressor to provide compatible TensorFlow INT8 quantization solution support with equivalent user experience.

  • XPUAutoShard on GPU [Experimental]

    Intel® Extension for TensorFlow* provides XPUAutoShard feature to automatically shard the input data and the TensorFlow graph, placing these data/graph shards on GPU devices to maximize the hardware usage.

  • OpenXLA

    Intel® Extension for TensorFlow* adopts a uniform Device API PJRT as the supported device plugin mechanism to implement Intel GPU backend for OpenXLA support on TensorFlow frontend.

  • Keras 3 Keras 3 with TensorFlow comes with a significant enhancement - the Just-In-Time (JIT) compilation is enabled by default. This feature leverages the XLA (Accelerated Linear Algebra) compiler to optimize TensorFlow computations. See Keras 3 to avoid possible performance issues and error.