Graph fusion

Intel® Extension for TensorFlow* provides graph optimization to fuse specified operator patterns into a new single operator for better performance.

Basic fusion

The basic list of supported fusions is shown below. These fusions require input and output of the same data type.

Pattern Operator number
(Equal, NotEqual, GreaterEqual, Greater, LessEqual, Less)+Cast 2
L2loss+AddN 2
BatchMatMul+Mul 2
Mul+AddN+TrainingOp 3
Conv+Bias 2
Conv+Bias+(Relu, Relu6, Elu, LeakyRelu, Gelu_erf, Gelu_tanh, Tanh, Sigmoid) 3
MatMul+Bias 2
MatMul+Bias+(Relu, Relu6, Elu, Gelu_erf, Gelu_tanh, Tanh, Sigmoid) 3
FusedBatchNorm+Relu 2
FusedBatchNormGrad+ReluGrad 2
Conv+Bias+Add 3
Conv+Bias+Add+(Relu, Relu6, Elu, LeakyRelu, Gelu_erf, Gelu_tanh, Tanh, Sigmoid) 4
MatMul+Bias+Add 3
MatMul+Bias+Add+(Relu, Relu6, Elu, Gelu_erf, Gelu_tanh, Tanh, Sigmoid) 4
MatMul+BiasAddGrad 2
ConvGradFilter+BiasAddGrad 2
Pad+Conv 2
BatchMatMul with variable post-op 2+
Swish 2
LayerNorm 3+

Mixed data type fusion

As stock TensorFlow only supports same-data-type input and output, inserting a cast node during BF16 inference and training may break the existing fusion pattern and impact performance.

Intel® Extension for TensorFlow* provides mixed data type fusion, which removes the additional data type conversions on the graph level.

Here is the list of supported mixed data type fusions, and we’ll take a closer look at MatMul as an example.

Pattern Fused operator Input data type Output data type oneDNN FP32 Math mode
MatMul + Cast AccMatMul BF16 FP32 N/A
FusedMatMul + Cast FusedAccMatMul BF16 FP32 N/A
AccMatMul + any MatMul Fusion FusedAccMatMul BF16 FP32 N/A
Cast + MatMul + Cast AccMatMul FP32 FP32 BF16
Cast + FusedMatMul + Cast FusedAccMatMul FP32 FP32 BF16

Implementation Details

The Cast + (Fused)MatMul + Cast pattern is covered by pattern matcher; the rest is covered by remapper fusion. The new kernels are implemented(AccMatMul and FusedAccMatMul(WithSum))as an extension of original MatMul with the following new attributes:

  • Tout: Output data type ∈ {float32}.

  • Tpost: Post op data type ∈ {bfloat16, float32}.

  • is_bf16_math_mode: A Boolean to indicate whether to use oneDNN BF16 math mode if FP32 input, FP32 output.

Generic layout optimizer

As the channels_first format is not supported by stock TensorFlow on CPU, it inserts transpose nodes before and after the Conv3D/MaxPool3D nodes. However, this problem does not exist in GPU device. To avoid unnecessary layout transformation when running on a GPU device, Intel® Extension for TensorFlow* adds a separate layout optimizer.

Pattern Fused operator Conv data format (before optimization) Conv data format (after optimization)
Transpose + Conv3D + Transpose Conv3D NDHWC NCDHW