Horovod with PyTorch (Experimental)

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall. To use Horovod with PyTorch, you need to install Horovod with Pytorch first, and make specific change for Horovod in your training script.

Install Horovod with PyTorch

You can use normal pip command to install Intel® Optimization for Horovod*:

python -m pip install intel-optimization-for-horovod

Note: Make sure you already install oneAPI basekit. You need to activate the environment when use Horovod.

source ${HOME}/intel/oneapi/ccl/latest/env/vars.sh

Horovod with PyTorch Usage

To use Horovod with PyTorch for XPU backend, make the following modifications to your training script:

Initialize Horovod.

 import torch
 import intel_extension_for_pytorch
 import horovod.torch as hvd
 hvd.init()

Pin each GPU to a single process.

With the typical setup of one GPU per process, set this to local rank. The first process on the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.
```
 devid = hvd.local_rank()
 torch.xpu.set_device(devid)
```
Scale the learning rate by the number of workers.

Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.
Wrap the optimizer in hvd.DistributedOptimizer.

The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.
Broadcast the initial variable states from rank 0 to all other processes:
```
 hvd.broadcast_parameters(model.state_dict(), root_rank=0)
 hvd.broadcast_optimizer_state(optimizer, root_rank=0)
```
This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.
Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.

Accomplish this by guarding model checkpointing code with hvd.rank() != 0.

Example:

import torch
import intel_extension_for_pytorch
import horovod.torch as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
devid = hvd.local_rank()
torch.xpu.set_device(devid)
device = "xpu:{}".format(devid)

# Define dataset...
train_dataset = ...

# Partition dataset among workers using DistributedSampler
train_sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, num_replicas=hvd.size(), rank=hvd.rank())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=..., sampler=train_sampler)

# Build model...
model = ...
model.to(device)

optimizer = optim.SGD(model.parameters())

# Add Horovod Distributed Optimizer
optimizer = hvd.DistributedOptimizer(optimizer, named_parameters=model.named_parameters())

# Broadcast parameters from rank 0 to all other processes.
hvd.broadcast_parameters(model.state_dict(), root_rank=0)

for epoch in range(100):
   for batch_idx, (data, target) in enumerate(train_loader):
       optimizer.zero_grad()
       output = model(data)
       loss = F.nll_loss(output, target)
       loss.backward()
       optimizer.step()
       if batch_idx % args.log_interval == 0:
           print('Train Epoch: {} [{}/{}]\tLoss: {}'.format(
               epoch, batch_idx * len(data), len(train_sampler), loss.item()))