Deep Neural Network Library (DNNL)  1.2.0
Performance library for Deep Learning
Convolution

API Reference

The convolution primitive computes forward, backward, or weight update for a batched convolution operation on 1D, 2D, or 3D spatial data with bias.

The convolution operation is defined by the following formulas. We show formulas only for 2D spatial data which are straightforward to generalize to cases of higher and lower dimensions. Variable names follow the standard Naming Conventions.

Let $$src$$, $$weights$$ and $$dst$$ be $$N \times IC \times IH \times IW$$, $$OC \times IC \times KH \times KW$$, and $$N \times OC \times OH \times OW$$ tensors respectively. Let $$bias$$ be a 1D tensor with $$OC$$ elements.

Furthermore, let the remaining convolution parameters be:

Parameter Depth Height Width Comment
Front, top, and left
$$PD_L$$ $$PH_L$$ $$PW_L$$ In the API we use padding_l to indicate the corresponding vector of paddings (_l in the name stands for left)
Back, bottom, and right
$$PD_R$$ $$PH_R$$ $$PW_R$$ In the API we use padding_r to indicate the corresponding vector of paddings (_r in the name stands for right)
Stride $$SD$$ $$SH$$ $$SW$$ Non-strided convolution should have the stride parameters equal 1
Dilation $$DD$$ $$DH$$ $$DW$$ Dilation starts with 0, so non-dilated convolution should have the dilation parameters equal 0

The following formulas show how DNNL computes convolutions. They are broken down into several types to simplify the exposition, but in reality the convolution types can be combined.

To further simplify the formulas, we assume that $$src(n, ic, ih, iw) = 0$$ if $$ih < 0$$, or $$ih \geq IH$$, or $$iw < 0$$, or $$iw \geq IW$$.

### Forward

#### Regular Convolution

$dst(n, oc, oh, ow) = bias(oc) + \\ + \sum_{ic=0}^{IC-1}\sum_{kh=0}^{KH-1}\sum_{kw=0}^{KW-1} src(n, ic, oh \cdot SH + kh - PH_L, ow \cdot SW + kw - PW_L) \cdot weights(oc, ic, kh, kw).$

Here:

• $$OH = \left\lfloor{\frac{IH \cdot SH - KH + PH_L + PH_R}{SH}} \right\rfloor + 1,$$
• $$OW = \left\lfloor{\frac{IW \cdot SW - KW + PW_L + PW_R}{SW}} \right\rfloor + 1.$$

#### Convolution with Groups

In the API, DNNL adds a separate groups dimension to memory objects representing weights tensors and represents weights as $$G \times OC_G \times IC_G \times KH \times KW$$ 5D tensors for 2D convolutions with groups.

$dst(n, g \cdot OC_G + oc_g, oh, ow) = bias(g \cdot OC_G + oc_g) + \\ + \sum_{ic_g=0}^{IC_G-1}\sum_{kh=0}^{KH-1}\sum_{kw=0}^{KW-1} src(n, g \cdot IC_G + ic_g, oh + kh - PH_L, ow + kw - PW_L) \cdot weights(g, oc_g, ic_g, kh, kw),$

where

• $$IC_G = \frac{IC}{G}$$,
• $$OC_G = \frac{OC}{G}$$, and
• $$oc_g \in [0, OC_G).$$

The case when $$OC_G = IC_G = 1$$ is also known as a depthwise convolution.

#### Convolution with Dilation

$dst(n, oc, oh, ow) = bias(oc) + \\ + \sum_{ic=0}^{IC-1}\sum_{kh=0}^{KH-1}\sum_{kw=0}^{KW-1} src(n, ic, oh + kh \cdot (DH + 1) - PH_L, ow + kw \cdot (DW + 1) - PW_L) \cdot weights(oc, ic, kh, kw).$

Here:

• $$OH = \left\lfloor{\frac{IH - DKH + PH_L + PH_R}{SH}} \right\rfloor + 1,$$ where $$DKH = 1 + (KH - 1) \cdot (DH + 1)$$, and
• $$OW = \left\lfloor{\frac{IW - DKW + PW_L + PW_R}{SW}} \right\rfloor + 1,$$ where $$DKW = 1 + (KW - 1) \cdot (DW + 1)$$.
Note
In DNNL dilation parameter equals 0 means no-dilation, i.e. regular convolution. Other libraries might use another convention, where dilation parameter equals 1 indicates no-dilation case.

#### Deconvolution (Transposed Convolution)

Deconvolutions (also called fractionally strided convolutions or transposed convolutions) work by swapping the forward and backward passes of a convolution. One way to put it is to note that the weights define a convolution, but whether it is a direct convolution or a transposed convolution is determined by how the forward and backward passes are computed.

#### Difference Between Forward Training and Forward Inference

There is no difference between the dnnl_forward_training and dnnl_forward_inference propagation kinds.

### Backward

The backward propagation computes $$diff\_src$$ based on $$diff\_dst$$ and $$weights$$.

The weights update computes $$diff\_weights$$ and $$diff\_bias$$ based on $$diff\_dst$$ and $$src$$.

Note
The optimized memory formats $$src$$ and $$weights$$ might be different on forward propagation, backward propagation, and weights update.

## Implementation Details

N/A.

### Data Types

Convolution primitive supports the following combination of data types for source, destination, and weights memory objects:

Propagation Source Weights Destination Bias
forward / backward f32 f32 f32 f32
forward f16 f16 f16 f16
forward u8, s8 s8 u8, s8, s32, f32 u8, s8, s32, f32
forward bf16 bf16 f32, bf16 f32, bf16
backward f32, bf16 bf16 bf16
weights update bf16 f32, bf16 bf16 f32, bf16
Warning
There might be hardware and/or implementation specific restrictions. Check Implementation Limitations section below.

### Data Representation

Like other CNN primitives, the convolution primitive expects the following tensors:

Spatial Source / Destination Weights
1D $$N \times C \times W$$ $$[G \times ] OC \times IC \times KW$$
2D $$N \times C \times H \times W$$ $$[G \times ] OC \times IC \times KH \times KW$$
3D $$N \times C \times D \times H \times W$$ $$[G \times ] OC \times IC \times KD \times KH \times KW$$

Physical format of data and weights memory objects is critical for convolution primitive performance. In the DNNL programming model, convolution is one of the few primitives that support the placeholder memory format tag dnnl::memory::format_tag::any (shortened to any from now on) and can define data and weight memory objects format based on the primitive parameters. When using any it is necessary to first create a convolution primitive descriptor and then query it for the actual data and weight memory objects formats.

While convolution primitives can be created with memory formats specified explicitly, the performance is likely to be suboptimal.

The table below shows the combinations for which plain memory formats the convolution primitive is optimized for.

Spatial Convolution Type Data / Weights logical tensor Implementation optimized for memory formats
1D, 2D, 3D any optimized
1D f32, bf16 NCW / OIW, GOIW dnnl_ncw (dnnl_abc) / dnnl_oiw (dnnl_abc), dnnl_goiw (dnnl_abcd)
1D int8 NCW / OIW dnnl_nwc (dnnl_acb) / dnnl_wio (dnnl_cba)
2D f32, bf16 NCHW / OIHW, GOIHW dnnl_nchw (dnnl_abcd) / dnnl_oihw (dnnl_abcd), dnnl_goihw (dnnl_abcde)
2D int8 NCHW / OIHW, GOIHW dnnl_nhwc (dnnl_acdb) / dnnl_hwio (dnnl_cdba), dnnl_hwigo (dnnl_decab)
3D f32, bf16 NCDHW / OIDHW, GOIDHW dnnl_ncdhw (dnnl_abcde) / dnnl_oidhw (dnnl_abcde), dnnl_goidhw (dnnl_abcdef)
3D int8 NCDHW / OIDHW dnnl_ndhwc (dnnl_acdeb) / dnnl_dhwio (dnnl_cdeba)

### Post-ops and Attributes

Post-ops and attributes enable you to modify the behavior of the convolution primitive by applying the output scale to the result of the primitive and by chaining certain operations after the primitive. The following attributes and post-ops are supported:

Propagation Type Operation Restrictions Description
forward attribute Output scale int8 convolutions only Scales the result of convolution by given scale factor(s)
forward post-op eltwise Applies an Eltwise operation to the result
forward post-op sum Adds the operation result to the destination tensor instead of overwriting it
Note
The library doesn't prevent using post-ops in training, but note that not all post-ops are feasible for training usage. For instance, using ReLU with non-zero negative slope parameter as a post-op would not produce an additional output workspace that is required to compute backward propagation correctly. Hence, in this particular case one should use separate convolution and eltwise primitives for training.

The following post-ops chaining is supported by the library:

Type of convolutions Post-ops sequence supported
f32 and bf16 convolution eltwise, sum, sum -> eltwise
int8 convolution eltwise, sum, sum -> eltwise, eltwise -> sum

The attributes and post-ops take effect in the following sequence:

• Output scale attribute,
• Post-ops, in order they were attached.

The operations during attributes and post-ops applying are done in single precision floating point data type. The conversion to the actual destination data type happens just before the actual storing.

#### Example 1

Consider the following pseudo code:

attribute attr;
attr.set_output_scale(alpha);
attr.set_post_ops({
{ sum={scale=beta} },
{ eltwise={scale=gamma, type=tanh, alpha=ignore, beta=ignored }
});
convolution_forward(src, weights, dst, attr)

The would lead to the following:

$dst(\overline{x}) = \gamma \cdot \tanh \left( \alpha \cdot conv(src, weights) + \beta \cdot dst(\overline{x}) \right)$

#### Example 2

The following pseudo code:

attribute attr;
attr.set_output_scale(alpha);
attr.set_post_ops({
{ eltwise={scale=gamma, type=relu, alpha=eta, beta=ignored }
{ sum={scale=beta} },
});
convolution_forward(src, weights, dst, attr)

That would lead to the following:

$dst(\overline{x}) = \beta \cdot dst(\overline{x}) + \gamma \cdot ReLU \left( \alpha \cdot conv(src, weights), \eta \right)$

## Algorithms

DNNL implements convolution primitives using several different algorithms:

• Direct. The convolution operation is computed directly using SIMD instructions. This is the algorithm used for the most shapes and supports int8, f32 and bf16 data types.
• Winograd. This algorithm reduces computational complexity of convolution at the expense of accuracy loss and additional memory operations. The implementation is based on the Fast Algorithms for Convolutional Neural Networks by A. Lavin and S. Gray. The Winograd algorithm often results in the best performance, but it is applicable only to particular shapes. Moreover, Winograd only supports int8 and f32 data types.
• Implicit GEMM. The convolution operation is reinterpreted in terms of matrix-matrix multiplication by rearranging the source data into a scratchpad memory. This is a fallback algorithm that is dispatched automatically when other implementations are not available. GEMM convolution supports the int8, f32, and bf16 data types.

#### Direct Algorithm

DNNL supports the direct convolution algorithm on all supported platforms for the following conditions:

• Data and weights memory formats are defined by the convolution primitive (user passes any).
• The number of channels per group is a multiple of SIMD width for grouped convolutions.
• For each spatial direction padding does not exceed one half of the corresponding dimension of the weights tensor.
• Weights tensor width does not exceed 14.

In case any of these constraints are not met, the implementation will silently fall back to an explicit GEMM algorithm.

DNNL supports the Winograd convolution algorithm on systems with Intel(R) AVX-512 support and above under the following conditions:

• Data and weights memory formats are defined by the convolution primitive (user passes any as the data format).
• The spatial domain is two-dimensional.
• The weights shape is 3x3, there are no groups, dilation or strides ( $$KH = KW = 3$$, $$SH = SW = 1$$, and $$DH = DW = 0$$).
• The data type is either int8 or f32.

In case any of these constraints is not met, the implementation will silently fall back to the direct algorithm.

The Winograd convolution algorithm implementation additionally chooses tile size based on the problem shape and propagation kind:

• For forward_inference DNNL supports $$F(2 \times 2, 3 \times 3)$$ or $$F(4 \times 4, 3 \times 3)$$
• DNNL supports only $$F(4 \times 4, 3 \times 3)$$ Winograd for all the training propagation kinds.

The following side effects should be weighed against the (potential) performance boost achieved from using the Winograd algorithm:

• Memory consumption. Winograd implementation in DNNL requires additional scratchpad memory to store intermediate results. As more convolutions using Winograd are added to the topology, the amount of memory required can grow significantly. This growth can be controlled if the scratchpad memory can be reused across multiple primitives. See Primitive Attributes: Scratchpad for more details.
• Accuracy. In some cases Winograd convolution produce results that are significantly less accurate than results from the direct convolution.

Create a Winograd convolution by simply creating a convolution descriptor (step 6 in simple network example specifying the Winograd algorithm. The rest of the steps are exactly the same.

auto conv1_desc = convolution_forward::desc(
conv1_src_md, conv1_weights_md, conv1_bias_md, conv1_dst_md,

#### Automatic Algorithm Selection

DNNL supports dnnl::algorithm::convolution_auto algorithm that instructs the library to automatically select the best algorithm based on the heuristics that take into account tensor shapes and the number of logical processors available. (For automatic selection to work as intended, use the same thread affinity settings when creating the convolution as when executing the convolution.)

## Implementation Limitations

1. Refer to Data Types for limitations related to data types support.
2. CPU
• Winograd are implemented only for Intel(R) AVX-512 or Intel(R) AVX512-DL Boost instruction sets
3. GPU
• No support for Winograd algorithm

## Performance Tips

• Use dnnl::memory::format_tag::any for source, weights, and destinations memory format tags when create a convolution primitive to allow the library to choose the most appropriate memory format.