# Graph fusion Intel® Extension for TensorFlow\* provides graph optimization to fuse specified operator patterns into a new single operator for better performance. ## Basic fusion The basic list of supported fusions is shown below. These fusions require input and output of the same data type. | Pattern | Operator number | | -- | -- | | (`Equal`, `NotEqual`, `GreaterEqual`, `Greater`, `LessEqual`, `Less`)+Cast | 2 | | `L2loss`+`AddN` | 2 | | `BatchMatMul`+`Mul` | 2 | | `Mul`+`AddN`+`TrainingOp` | 3 | | `Conv`+`Bias` | 2 | | `Conv`+`Bias`+(`Relu`, `Relu6`, `Elu`, `LeakyRelu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`) | 3 | | `MatMul`+`Bias` | 2 | | `MatMul`+`Bias`+(`Relu`, `Relu6`, `Elu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`) | 3 | | `FusedBatchNorm+Relu` | 2 | | `FusedBatchNormGrad+ReluGrad` | 2 | | `Conv+Bias+Add` | 3 | | `Conv`+`Bias`+`Add`+(`Relu`, `Relu6`, `Elu`, `LeakyRelu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`) | 4 | | `MatMul`+`Bias`+`Add` | 3 | | `MatMul`+`Bias`+`Add`+(`Relu`, `Relu6`, `Elu`, `Gelu_erf`, `Gelu_tanh`, `Tanh`, `Sigmoid`) | 4 | | `MatMul+BiasAddGrad` | 2 | | `ConvGradFilter`+`BiasAddGrad` | 2 | | `Pad`+`Conv` | 2 | | `BatchMatMul` with variable post-op | 2+ | | `Swish` | 2 | | `LayerNorm` | 3+ | ## Mixed data type fusion As stock TensorFlow only supports same-data-type input and output, inserting a cast node during `BF16` inference and training may break the existing fusion pattern and impact performance. Intel® Extension for TensorFlow\* provides mixed data type fusion, which removes the additional data type conversions on the graph level. Here is the list of supported mixed data type fusions, and we'll take a closer look at `MatMul` as an example. | Pattern | Fused operator | Input data type | Output data type| oneDNN `FP32` Math mode | | -- | -- | -- | -- | -- | | `MatMul + Cast` | `AccMatMul` | `BF16` | `FP32` | N/A | | `FusedMatMul + Cast` | `FusedAccMatMul` | `BF16` | `FP32` | N/A | | `AccMatMul + any MatMul Fusion` | `FusedAccMatMul` | `BF16` | `FP32` | N/A | | `Cast + MatMul + Cast` | `AccMatMul` | `FP32` | `FP32` | `BF16` | | `Cast + FusedMatMul + Cast` | `FusedAccMatMul` | `FP32` | `FP32` | `BF16` | #### Implementation Details The `Cast + (Fused)MatMul + Cast` pattern is covered by pattern matcher; the rest is covered by remapper fusion. The new kernels are implemented(`AccMatMul` and `FusedAccMatMul(WithSum)`)as an extension of original `MatMul` with the following new attributes: - `Tout`: Output data type ∈ {`float32`}. - `Tpost`: Post op data type ∈ {`bfloat16`, `float32`}. - `is_bf16_math_mode`: A Boolean to indicate whether to use oneDNN `BF16` math mode if `FP32` input, `FP32` output. ## Generic layout optimizer As the channels_first format is not supported by stock TensorFlow on CPU, it inserts transpose nodes before and after the Conv3D/MaxPool3D nodes. However, this problem does not exist in GPU device. To avoid unnecessary layout transformation when running on a GPU device, Intel® Extension for TensorFlow\* adds a separate layout optimizer. | Pattern | Fused operator | Conv data format (before optimization) | Conv data format (after optimization)| | -- | -- | -- | -- | | `Transpose + Conv3D + Transpose` | `Conv3D` | `NDHWC` | `NCDHW` |