Graph fusion¶

Intel® Extension for TensorFlow* provides graph optimization to fuse specified operator patterns into a new single operator for better performance.

Basic fusion¶

The basic list of supported fusions is shown below. These fusions require input and output of the same data type.

Pattern	Operator number
(`Equal`, `NotEqual`, `GreaterEqual`, `Greater`, `LessEqual`, `Less`)+Cast	2
`L2loss`+`AddN`	2
`BatchMatMul`+`Mul`	2
`Mul`+`AddN`+`TrainingOp`	3
`Conv`+`Bias`	2
`Conv`+`Bias`+(`Relu`, `Relu6`, `Elu`, `LeakyRelu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	3
`MatMul`+`Bias`	2
`MatMul`+`Bias`+(`Relu`, `Relu6`, `Elu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	3
`FusedBatchNorm+Relu`	2
`FusedBatchNormGrad+ReluGrad`	2
`Conv+Bias+Add`	3
`Conv`+`Bias`+`Add`+(`Relu`, `Relu6`, `Elu`, `LeakyRelu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	4
`MatMul`+`Bias`+`Add`	3
`MatMul`+`Bias`+`Add`+(`Relu`, `Relu6`, `Elu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`)	4
`MatMul+BiasAddGrad`	2
`ConvGradFilter`+`BiasAddGrad`	2
`Pad`+`Conv`	2
`BatchMatMul` with variable post-op	2+
`Swish`	2
`LayerNorm`	3+

Mixed data type fusion¶

As stock TensorFlow only supports same-data-type input and output, inserting a cast node during BF16 inference and training may break the existing fusion pattern and impact performance.

Intel® Extension for TensorFlow* provides mixed data type fusion, which removes the additional data type conversions on the graph level.

Here is the list of supported mixed data type fusions, and we’ll take a closer look at MatMul as an example.

Pattern	Fused operator	Input data type	Output data type	oneDNN `FP32` Math mode
`MatMul + Cast`	`AccMatMul`	`BF16`	`FP32`	N/A
`FusedMatMul + Cast`	`FusedAccMatMul`	`BF16`	`FP32`	N/A
`AccMatMul + any MatMul Fusion`	`FusedAccMatMul`	`BF16`	`FP32`	N/A
`Cast + MatMul + Cast`	`AccMatMul`	`FP32`	`FP32`	`BF16`
`Cast + FusedMatMul + Cast`	`FusedAccMatMul`	`FP32`	`FP32`	`BF16`

Implementation Details¶

The Cast + (Fused)MatMul + Cast pattern is covered by pattern matcher; the rest is covered by remapper fusion. The new kernels are implemented（AccMatMul and FusedAccMatMul(WithSum)）as an extension of original MatMul with the following new attributes:

Tout: Output data type ∈ {float32}.
Tpost: Post op data type ∈ {bfloat16, float32}.
is_bf16_math_mode: A Boolean to indicate whether to use oneDNN BF16 math mode if FP32 input, FP32 output.

Generic layout optimizer¶

As the channels_first format is not supported by stock TensorFlow on CPU, it inserts transpose nodes before and after the Conv3D/MaxPool3D nodes. However, this problem does not exist in GPU device. To avoid unnecessary layout transformation when running on a GPU device, Intel® Extension for TensorFlow* adds a separate layout optimizer.

Pattern	Fused operator	Conv data format (before optimization)	Conv data format (after optimization)
`Transpose + Conv3D + Transpose`	`Conv3D`	`NDHWC`	`NCDHW`