Graph fusion

Intel® Extension for TensorFlow* provides graph optimization to fuse specified operator patterns into a new single operator for better performance.

Basic fusion

The basic list of supported fusions is shown below. These fusions require input and output of the same data type.

Pattern	Operator number
(`Equal`, `NotEqual`, `GreaterEqual`, `Greater`, `LessEqual`, `Less`)+Cast	2
`L2loss`+`AddN`	2
`BatchMatMul`+`Mul`	2
`Mul`+`AddN`+`TrainingOp`	3
`Conv`+`Bias`	2
`Conv`+`Bias`+(`Relu`, `Relu6`, `Elu`, `LeakyRelu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	3
`MatMul`+`Bias`	2
`MatMul`+`Bias`+(`Relu`, `Relu6`, `Elu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	3
`FusedBatchNorm+Relu`	2
`FusedBatchNormGrad+ReluGrad`	2
`Conv+Bias+Add`	3
`Conv`+`Bias`+`Add`+(`Relu`, `Relu6`, `Elu`, `LeakyRelu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	4
`MatMul`+`Bias`+`Add`	3
`MatMul`+`Bias`+`Add`+(`Relu`, `Relu6`, `Elu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	4
`MatMul+BiasAddGrad`	2
`ConvGradFilter`+`BiasAddGrad`	2
`Pad`+`Conv`	2
`BatchMatMul` with variable post-op	2+
`Swish`	2
`LayerNorm`	3+

Mixed data type fusion

As stock TensorFlow only supports same-data-type input and output, inserting a cast node during BF16 inference and training may break the existing fusion pattern and impact performance.

Intel® Extension for TensorFlow* provides mixed data type fusion, which removes the additional data type conversions on the graph level.

Here is the list of supported mixed data type fusions, and we’ll take a closer look at MatMul as an example.

Pattern	Fused operator	Input data type	Output data type	oneDNN `FP32` Math mode
`MatMul + Cast`	`AccMatMul`	`BF16`	`FP32`	N/A
`FusedMatMul + Cast`	`FusedAccMatMul`	`BF16`	`FP32`	N/A
`AccMatMul + any MatMul Fusion`	`FusedAccMatMul`	`BF16`	`FP32`	N/A
`Cast + MatMul + Cast`	`AccMatMul`	`FP32`	`FP32`	`BF16`
`Cast + FusedMatMul + Cast`	`FusedAccMatMul`	`FP32`	`FP32`	`BF16`

Implementation Details

The Cast + (Fused)MatMul + Cast pattern is covered by pattern matcher; the rest is covered by remapper fusion. The new kernels are implemented（AccMatMul and FusedAccMatMul(WithSum)）as an extension of original MatMul with the following new attributes:

Tout: Output data type ∈ {float32}.
Tpost: Post op data type ∈ {bfloat16, float32}.
is_bf16_math_mode: A Boolean to indicate whether to use oneDNN BF16 math mode if FP32 input, FP32 output.

Generic layout optimizer

As the channels_first format is not supported by stock TensorFlow on CPU, it inserts transpose nodes before and after the Conv3D/MaxPool3D nodes. However, this problem does not exist in GPU device. To avoid unnecessary layout transformation when running on a GPU device, Intel® Extension for TensorFlow* adds a separate layout optimizer.

Pattern	Fused operator	Conv data format (before optimization)	Conv data format (after optimization)
`Transpose + Conv3D + Transpose`	`Conv3D`	`NDHWC`	`NCDHW`