State-of-the-art BERT Fine-tune training and Inference

Overview

Driven by real-life use cases ranging from medical diagnostics to financial fraud detection, deep learning’s neural networks are growing in size and complexity as more and more data is available to be consumed. This influx of data allows for more accuracy in data scoring, which can result in better AI models, but presents challenges to compute performance. To help keep up with the computational power required to run the deep learning workloads, Google came up with the BFLOAT16 format to increase performance on their Tensor Processing Units (TPUs).

BFLOAT16 uses one bit for sign, eight for exponent, and seven for mantissa. Due to its dynamic range, it can be used to represent gradients directly without the need for loss scaling. It has been shown that the BFLOAT16 format works as well as the FP32 format while delivering increased performance and reducing memory usage.

Intel’s 3rd Gen Intel® Xeon® Scalable processor, featuring Intel® Deep Learning Boost, is the first general-purpose x86 CPU to support the BFLOAT16 format.

The latest Deep Learning Reference Stack (DLRS) 7.0 release integrated Intel optimized TensorFlow, which enables BFLOAT16 support for servers with the 3rd Gen Intel® Xeon® Scalable processor. Now AI developers can quickly develop, iterate and run BFLOAT16 models directly by utilizing the DLRS stack.

In this guide we walk through a solution to set up your infrastructure and deploy a Bidirectional Encoder Representations from Transformers (BERT) fine tune training and inference workload using the DLRS containers from Intel.

Required Software

  • Install Ubuntu* 20.04 Linux OS on your host system

  • Install Docker* Engine

Steps

  1. Download the DLRS v0.7.0 Image and launch it in interactive mode

    docker pull sysstacks/dlrs-tensorflow2-ubuntu:v0.7.0-intel-tf
    docker run -it --shm-size 8g --security-opt seccomp=unconfined sysstacks/dlrs-tensorflow2-ubuntu:v0.7.0-intel-tf bash
    
  2. At the container shell prompt, run the following command to install required tools

    apt-get update
    apt-get install wget unzip git
    

Run BERT Fine-tune training with the Squad 1.1 data set

  1. Download wwm_cased_L-24_H-1024_A-16.zip, which contains a pre-trained Bert model with 24 layers, 1024 hidden units, and 16 attention heads. The model has been trained with the whole word masked using a wordpiece tokenized by Google.

    wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip -P /tmp
    unzip /tmp/wwm_cased_L-24_H-1024_A-16.zip -d /tmp
    
  2. Download the Squad 1.1 dataset. This reading comprehension dataset consists of questions posed on a set of Wikipedia articles, where the answer to every question is in the corresponding passage. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles

    wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -P /tmp
    wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P /tmp
    wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py -P /tmp
    
  3. Download the Model Zoo for Intel® Architecture. The Intel Model Zoo* contains links to pre-trained models, sample scripts, best practices, and step-by-step tutorials for many popular open-source machine learning models optimized by Intel to run on Intel® Xeon® Scalable processors. We use the Model Zoo script to run the BERT Fine-tune training and Inference

    git clone -b v1.6.1 https://github.com/IntelAI/models /tmp/models
    
  4. Change PYTHONPATH to include model zoo benchmark directory

    export PYTHONPATH=$PYTHONPATH:/tmp/models/benchmarks
    
  5. Run Bert fine-tune training with Squad 1.1 data set. Note that these parameters are subject to change according to your hardware requirements.

    python3 /tmp/models/benchmarks/launch_benchmark.py \
    --model-name=bert_large \
    --precision=bfloat16 \
    --mode=training \
    --mpi_num_processes=4 \
    --mpi_num_processes_per_socket=1 \
    --num-intra-threads=22 \
    --num-inter-threads=1 \
    --output-dir=/tmp/SQuAD_1.1_fine_tune \
    --framework=tensorflow \
    --batch-size=24 \
    -- train_option=SQuAD \
    vocab_file=/tmp/wwm_cased_L-24_H-1024_A-16/vocab.txt \
    config_file=/tmp/wwm_cased_L-24_H-1024_A-16/bert_config.json \
    init_checkpoint=/tmp/wwm_cased_L-24_H-1024_A-16/bert_model.ckpt \
    do_train=True \
    train_file=/tmp/train-v1.1.json \
    do_predict=True \
    predict_file=/tmp/dev-v1.1.json \
    learning_rate=1.5e-5  \
    num_train_epochs=2 \
    warmup-steps=0 \
    doc_stride=128 \
    do_lower_case=False \
    max_seq_length=384 \
    experimental_gelu=True \
    optimized-softmax=True \
    mpi_workers_sync_gradients=True
    
  6. Run BERT Inference with the Squad 1.1 data set

    • Download the pre-trained BERT Large Model

      wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip -P /tmp
      unzip /tmp/wwm_uncased_L-24_H-1024_A-16.zip -d /tmp
      
    • For BERT Inference, we will use the Intel optimized check point directly

      wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6_1/bert_large_checkpoints.zip -P /tmp
      unzip /tmp/bert_large_checkpoints.zip -d /tmp
      
  7. Download the Squad 1.1 data set to the BERT Large Model directory

    wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P /tmp/wwm_uncased_L-24_H-1024_A-16
    
  8. Run BERT Inference with following command

    numactl --localalloc --physcpubind=0-27 python3 /tmp/models/benchmarks/launch_benchmark.py \
     --model-name=bert_large \
     --precision=bfloat16 \
     --mode=inference \
    --framework=tensorflow \
    --num-inter-threads 1 \
    --num-intra-threads 28 \
    --batch-size=32 \
    --data-location /tmp/wwm_uncased_L-24_H-1024_A-16 \
    --checkpoint /tmp/bert_large_checkpoints \
    --output-dir /tmp/SQuAD_1.1_inference \
    --benchmark-only
    

NOTICES AND DISCLAIMERS

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.