Graph fusion
Intel® Extension for TensorFlow* provides graph optimization to fuse specified operator patterns into a new single operator for better performance.
Basic fusion
The basic list of supported fusions is shown below. These fusions require input and output of the same data type.
Pattern | Operator number |
---|---|
(Equal , NotEqual , GreaterEqual , Greater , LessEqual , Less )+Cast |
2 |
L2loss +AddN |
2 |
BatchMatMul +Mul |
2 |
Mul +AddN +TrainingOp |
3 |
Conv +Bias |
2 |
Conv +Bias +(Relu , Relu6 , Elu , LeakyRelu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) |
3 |
MatMul +Bias |
2 |
MatMul +Bias +(Relu , Relu6 , Elu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) |
3 |
FusedBatchNorm+Relu |
2 |
FusedBatchNormGrad+ReluGrad |
2 |
Conv+Bias+Add |
3 |
Conv +Bias +Add +(Relu , Relu6 , Elu , LeakyRelu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) |
4 |
MatMul +Bias +Add |
3 |
MatMul +Bias +Add +(Relu , Relu6 , Elu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) |
4 |
MatMul+BiasAddGrad |
2 |
ConvGradFilter +BiasAddGrad |
2 |
Pad +Conv |
2 |
BatchMatMul with variable post-op |
2+ |
Swish |
2 |
LayerNorm |
3+ |
Mixed data type fusion
As stock TensorFlow only supports same-data-type input and output, inserting a cast node during BF16
inference and training may break the existing fusion pattern and impact performance.
Intel® Extension for TensorFlow* provides mixed data type fusion, which removes the additional data type conversions on the graph level.
Here is the list of supported mixed data type fusions, and we’ll take a closer look at MatMul
as an example.
Pattern | Fused operator | Input data type | Output data type | oneDNN FP32 Math mode |
---|---|---|---|---|
MatMul + Cast |
AccMatMul |
BF16 |
FP32 |
N/A |
FusedMatMul + Cast |
FusedAccMatMul |
BF16 |
FP32 |
N/A |
AccMatMul + any MatMul Fusion |
FusedAccMatMul |
BF16 |
FP32 |
N/A |
Cast + MatMul + Cast |
AccMatMul |
FP32 |
FP32 |
BF16 |
Cast + FusedMatMul + Cast |
FusedAccMatMul |
FP32 |
FP32 |
BF16 |
Implementation Details
The Cast + (Fused)MatMul + Cast
pattern is covered by pattern matcher; the rest is covered by remapper fusion.
The new kernels are implemented(AccMatMul
and FusedAccMatMul(WithSum)
)as an extension of original MatMul
with the following new attributes:
Tout
: Output data type ∈ {float32
}.Tpost
: Post op data type ∈ {bfloat16
,float32
}.is_bf16_math_mode
: A Boolean to indicate whether to use oneDNNBF16
math mode ifFP32
input,FP32
output.
Generic layout optimizer
As the channels_first format is not supported by stock TensorFlow on CPU, it inserts transpose nodes before and after the Conv3D/MaxPool3D nodes. However, this problem does not exist in GPU device. To avoid unnecessary layout transformation when running on a GPU device, Intel® Extension for TensorFlow* adds a separate layout optimizer.
Pattern | Fused operator | Conv data format (before optimization) | Conv data format (after optimization) |
---|---|---|---|
Transpose + Conv3D + Transpose |
Conv3D |
NDHWC |
NCDHW |