State-of-the-art BERT Fine-tune training and Inference
Overview
Driven by real-life use cases ranging from medical diagnostics to financial fraud detection, deep learning’s neural networks are growing in size and complexity as more and more data is available to be consumed. This influx of data allows for more accuracy in data scoring, which can result in better AI models, but presents challenges to compute performance. To help keep up with the computational power required to run the deep learning workloads, Google came up with the BFLOAT16 format to increase performance on their Tensor Processing Units (TPUs).
BFLOAT16 uses one bit for sign, eight for exponent, and seven for mantissa. Due to its dynamic range, it can be used to represent gradients directly without the need for loss scaling. It has been shown that the BFLOAT16 format works as well as the FP32 format while delivering increased performance and reducing memory usage.
Intel’s 3rd Gen Intel® Xeon® Scalable processor, featuring Intel® Deep Learning Boost, is the first general-purpose x86 CPU to support the BFLOAT16 format.
The latest Deep Learning Reference Stack (DLRS) 7.0 release integrated Intel optimized TensorFlow, which enables BFLOAT16 support for servers with the 3rd Gen Intel® Xeon® Scalable processor. Now AI developers can quickly develop, iterate and run BFLOAT16 models directly by utilizing the DLRS stack.
In this guide we walk through a solution to set up your infrastructure and deploy a Bidirectional Encoder Representations from Transformers (BERT) fine tune training and inference workload using the DLRS containers from Intel.
Recommended Hardware
We recommend a 3rd Generation Intel Xeon Scalable processor to get the optimal performance and take advantage of the built in Intel® Deep Learning Boost (Intel® DL Boost) and BFLOAT16(BF16) extension functionality.
Required Software
Install Ubuntu* 20.04 Linux OS on your host system
Install Docker* Engine
Steps
Download the DLRS v0.7.0 Image and launch it in interactive mode
docker pull sysstacks/dlrs-tensorflow2-ubuntu:v0.7.0-intel-tf docker run -it --shm-size 8g --security-opt seccomp=unconfined sysstacks/dlrs-tensorflow2-ubuntu:v0.7.0-intel-tf bash
At the container shell prompt, run the following command to install required tools
apt-get update apt-get install wget unzip git
Run BERT Fine-tune training with the Squad 1.1 data set
Download wwm_cased_L-24_H-1024_A-16.zip, which contains a pre-trained Bert model with 24 layers, 1024 hidden units, and 16 attention heads. The model has been trained with the whole word masked using a wordpiece tokenized by Google.
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_cased_L-24_H-1024_A-16.zip -P /tmp unzip /tmp/wwm_cased_L-24_H-1024_A-16.zip -d /tmp
Download the Squad 1.1 dataset. This reading comprehension dataset consists of questions posed on a set of Wikipedia articles, where the answer to every question is in the corresponding passage. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -P /tmp wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P /tmp wget https://github.com/allenai/bi-att-flow/blob/master/squad/evaluate-v1.1.py -P /tmp
Download the Model Zoo for Intel® Architecture. The Intel Model Zoo* contains links to pre-trained models, sample scripts, best practices, and step-by-step tutorials for many popular open-source machine learning models optimized by Intel to run on Intel® Xeon® Scalable processors. We use the Model Zoo script to run the BERT Fine-tune training and Inference
git clone -b v1.6.1 https://github.com/IntelAI/models /tmp/models
Change PYTHONPATH to include model zoo benchmark directory
export PYTHONPATH=$PYTHONPATH:/tmp/models/benchmarks
Run Bert fine-tune training with Squad 1.1 data set. Note that these parameters are subject to change according to your hardware requirements.
python3 /tmp/models/benchmarks/launch_benchmark.py \ --model-name=bert_large \ --precision=bfloat16 \ --mode=training \ --mpi_num_processes=4 \ --mpi_num_processes_per_socket=1 \ --num-intra-threads=22 \ --num-inter-threads=1 \ --output-dir=/tmp/SQuAD_1.1_fine_tune \ --framework=tensorflow \ --batch-size=24 \ -- train_option=SQuAD \ vocab_file=/tmp/wwm_cased_L-24_H-1024_A-16/vocab.txt \ config_file=/tmp/wwm_cased_L-24_H-1024_A-16/bert_config.json \ init_checkpoint=/tmp/wwm_cased_L-24_H-1024_A-16/bert_model.ckpt \ do_train=True \ train_file=/tmp/train-v1.1.json \ do_predict=True \ predict_file=/tmp/dev-v1.1.json \ learning_rate=1.5e-5 \ num_train_epochs=2 \ warmup-steps=0 \ doc_stride=128 \ do_lower_case=False \ max_seq_length=384 \ experimental_gelu=True \ optimized-softmax=True \ mpi_workers_sync_gradients=True
Run BERT Inference with the Squad 1.1 data set
Download the pre-trained BERT Large Model
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip -P /tmp unzip /tmp/wwm_uncased_L-24_H-1024_A-16.zip -d /tmp
For BERT Inference, we will use the Intel optimized check point directly
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6_1/bert_large_checkpoints.zip -P /tmp unzip /tmp/bert_large_checkpoints.zip -d /tmp
Download the Squad 1.1 data set to the BERT Large Model directory
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -P /tmp/wwm_uncased_L-24_H-1024_A-16
Run BERT Inference with following command
numactl --localalloc --physcpubind=0-27 python3 /tmp/models/benchmarks/launch_benchmark.py \ --model-name=bert_large \ --precision=bfloat16 \ --mode=inference \ --framework=tensorflow \ --num-inter-threads 1 \ --num-intra-threads 28 \ --batch-size=32 \ --data-location /tmp/wwm_uncased_L-24_H-1024_A-16 \ --checkpoint /tmp/bert_large_checkpoints \ --output-dir /tmp/SQuAD_1.1_inference \ --benchmark-only
NOTICES AND DISCLAIMERS
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. © Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.