Optimizer Fusion on GPU

Introduction

As with TorchScript, operation fusion reduces the number of operators that will be executed, and reduces overhead time. This methodology is also applied in Intel® Extension for PyTorch* optimizer optimization. We support SGD/Adam/AdamW/Lamb/Lars fusion for both FP32/BF16 at current stage.

Let’s examine the code in sgd update as an example.

    # original version
    if weight_decay != 0:
        grad = grad.add(param, alpha=weight_decay)
    if momentum != 0:
      buf = momentum_buffer_list[i]
      if buf is None:
          buf = torch.clone(grad).detach()
          momentum_buffer_list[i] = buf
      else:
          buf.mul_(momentum).add_(grad, alpha=1 - dampening)
    if nesterov:
        grad = grad.add(buf, alpha=momentum)
    else:
        grad = buf

    param.add_(grad, alpha=-lr)

Operation Fusion

One problem of the native implementation above is that we need to access the storages of grad, param, and buf several times. For large topologies, grad and parameters might not be stored in the cache. When we need to access the storage of grad again when executing the remaining clauses, the processor must read data out of low speed memory again instead of the more efficient high speed cache. This is a memory-bound bottle neck preventing good performance.

Operation fusion is a way to solve this problem. The clauses in the pseudo code are all element-wise operations, so we can fuse them into a single operation, as in the pseudo code below.

   # fused version
   sgd_fused_step(param, grad, buf, ...(other args))

After fusion, one operation sgd_fused_step can provide equivalent functionality but much better performance compared with original version of sgd update.