Split SGD
Not only optimizations for inference workloads are Intel’s focus, training workloads are also within Intel’s optimization scope. As part of it, optimizations for train optimizer functions are an important perspective. The optimizations as implemented as a mechanism called Split SGD, taking advantage of BFloat16 data type and operator fusion. Optimizer adagrad, lamb and sgd are supported.
BFloat16
The figure below shows definition of Float32 (top) and BFloat16 (bottom) data types. Comparing to Float32, BFloat16 is only half-long, and thus saves half memory. It is supported natively at instruction set level to boost deep learning workloads from the 3rd Generation of Xeon Scalable Processors. It is highly compatible to Float32, both have the same bit length for “sign” and “exponent” part. Though, BFloat16 only has 7-bit “mantissa” part while Float32 has 23 bits. This makes BFloat16 has the same capacity to represent “digit ranges” with that of Float32, but has shorter “precision” part.

Advantage of BFloat16 is that it saves memory and reduces computation workload, but the less mantissa bits brings negative effects as well. Let’s use an “ADD” operation as an example to explain the disadvantage. To perform addition of 2 floating point numbers, we need to shift the mantissa part of them left or right to align their exponent parts. Since BFloat16 has shorter mantissa part, it is much easier than Float32 to lose its mantissa part after the shifting, and thus cause accuracy issue.
Let’s use the following two decimal numbers x and y as an example. We first do the calculation in a high precision data type (10 valid numbers after decimal point).
This makes sense because after shifting y right by 5 digits, the fraction part is still there.
Then, let’s do the calculation in a low precision data type (5 valid numbers after decimal point)
Since the data type has only 5 digits for the fraction part, after shifting y by 5 digits, its fraction part is fully removed. This brings accuracy loss. This is a drawback of lower precision data types form their nature.
Stochastic Gradient Descent (SGD)
Basically, training involves 3 steps:
Forward propagation: Performance inference once and compare the results with ground truth to get loss number.
Backward propagation: Utilize chain rule to calculate gradients of parameters based on the loss number.
Parameter update: Update value of parameters by gradients along with calculated loss values.
The training is actually a loop of these 3 steps in sequence untill the loss number meets requirements or after a determine timeout duration. The Stochastic Gradient Descent (SGD) is most widely used at the 3rd step to update parameter values. To make it easy to understand, the 3rd step is described as the following formula:
Where
Split SGD
Since the addition applied in SGD is repeated again and again, according to the drawback that we mentioned before of low precision data types, if both the
The idea is to “split” a 32-bit floating point number into 2 parts:
Top half: First 16 bits can be viewed as exactly a BFloat16 number.
Bottom half: Last 16 bits are still kept to avoid accuracy loss.
FP32 parameters are split into “Top half” and “Bottom half”. When performing forward and backward propagations, the Top halfs are used to benefit from Intel BFloat16 support. When performing paramter update with SGD, we concatenate the Top half and the Bottom half to recover the parameters back to FP32 and then perform regular SGD operations.
It is a common pratice to use FP32 for master parameters in order to avoid round-off errors with BF16 parameter update. SplitSGD is an optimization of storing FP32 master parameters with reduced memory footprint.

The following pseudo code illustrates the process of Split SGD.
fp32_w = concat_fp32_from_bf16(bf16_w, trail)
fp32_gw = bf16_gw.float()
fp32_w += α* fp32_gw (sgd step without weight_dacay, momentum)
bf16_w, trail = split_bf16_from_fp32(fp32_w)