Text Classification fine tuning using TensorFlow and the Intel® Transfer Learning Tool API

This notebook uses the tlt library to fine tune a TensorFlow pretrained model from Hugging Face for text classification.

1. Import dependencies and setup parameters

This notebook assumes that you have already followed the instructions to setup a TensorFlow environment with all the dependencies required to run the notebook.

[ ]:
import numpy as np
import os
import pandas as pd
import tensorflow as tf

# tlt imports
from tlt.datasets import dataset_factory
from tlt.models import model_factory
from tlt.utils.file_utils import download_and_extract_zip_file

# Specify a directory for the dataset to be downloaded
dataset_dir = os.environ["DATASET_DIR"] if "DATASET_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "dataset")

# Specify a directory for output
output_dir = os.environ["OUTPUT_DIR"] if "OUTPUT_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "output")

print("Dataset directory:", dataset_dir)
print("Output directory:", output_dir)

2. Get the model

In this step, we call the Intel Transfer Learning Tool model factory to list supported TensorFlow image classification models. This is a list of pretrained models from Hugging Face that we tested with our API. Optionally, the verbose=True argument can be added to the print_supported_models function call to get more information about each model (such as the model hub, the original dataset, etc).

[ ]:
# See a list of available text classification models
model_factory.print_supported_models(use_case='text_classification', framework='tensorflow')

Use the Intel Transfer Learning Tool model factory to get one of the models listed in the previous cell. The get_model function returns a TLT model object that will later be used for training.

[ ]:
model_name = "google_bert_uncased_L-2_H-128_A-2"
framework = "tensorflow"

model = model_factory.get_model(model_name, framework)

print("Model name:", model.model_name)
print("Framework:", model.framework)
print("Use case:", model.use_case)

3. Get the dataset

Option A: Use your own dataset

This option allows for using your own text classification dataset from a .csv file. The dataset factory will expect text classification .csv files to have two columns where the first column is the label and the second column is the text/sentence to classify.

For example, the contents of a comma separated value file should look similar to this:

<label>,<text>
<label>,<text>
<label>,<text>

If the .csv has more columns, the select_cols or exclude_cols parameters can be used to filter out which columns are parsed.

This example is downloading the SMS Spam Collection dataset, which has a tab separated value file in the .zip file. This dataset has labeled SMS text messages that are either being classified as ham or spam. The first column in the data file has the label (ham or spam) and the second column is the text of the SMS mesage. (Note: Please see this dataset’s applicable license for terms and conditions. Intel Corporation does not own the rights to this data set and does not confer any rights to it.)

When using your own dataset, update the path to your dataset directory, as well the other variables with properties about the dataset like the csv file name, class names, delimiter, header, and the map function (if string labels need to be translated into numerical values).

[ ]:
zip_file_url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
sms_data_directory = os.path.join(dataset_dir, "sms_spam_collection")
csv_file_name = "SMSSpamCollection"

# If the SMS Spam collection csv file is not found, download and extract the file:
if not os.path.exists(os.path.join(sms_data_directory, csv_file_name)):
    # Download the zip file with the SMS Spam collection dataset
    download_and_extract_zip_file(zip_file_url, sms_data_directory)

    # Print list of files that we have in our dataset directory
    print(os.listdir(sms_data_directory))

# Specify the class names for the dataset being used
class_names = ["ham", "spam"]

# Specify the delimiter for the csv file
delimiter = "\t"

# Specify if the csv file has a header row that should be skipped when parsing the dataset
header = False

# Function to map the string label from the dataset to a numerical value
def label_map_func(x):
    return int(x == "spam")

After the dataset has been downloaded and extracted, use the dataset factory to load the dataset. The load_dataset method has parameters with information used to load the dataset.

[ ]:
dataset = dataset_factory.load_dataset(sms_data_directory, "text_classification", "tensorflow",
                                       csv_file_name=csv_file_name, class_names=class_names,
                                       label_map_func=label_map_func, delimiter=delimiter, header=header)

print(dataset.info)
print("\nClass names:", str(dataset.class_names))

Skip to the next step 4. Prepare the dataset to continue using your own dataset.

Option B: Use the TensorFlow datasets catalog

Option B allows for using a dataset from the TensorFlow datasets catalog. The dataset factory currently supports the following TFDS text classification datasets: imdb_reviews, glue/sst2, glue/cola, and ag_news_subset.

[ ]:
# Supported datasets: imdb_reviews, glue/sst2, glue/cola, ag_news_subset
dataset_name = "ag_news_subset"
dataset = dataset_factory.get_dataset(dataset_dir, model.use_case, model.framework, dataset_name,
                                      dataset_catalog="tf_datasets", shuffle_files=True)

print(dataset.info)
print("\nClass names:", str(dataset.class_names))

4. Prepare the dataset

Once you have your dataset from Option A or Option B above, use the following cells to split and preprocess the data. We split them into training and validation subsets, then resize the images to match the selected models, and then batch the images.

[ ]:
# Create splits for training and validation and batch the dataset
dataset.shuffle_split(train_pct=0.75, val_pct=0.25)
dataset.preprocess(batch_size=32)

5. Fine tuning

The TLT model’s train function is called with the dataset that was just prepared, along with an output directory for checkpoints, and the number of training epochs.

Mixed precision uses both 16-bit and 32-bit floating point types to make training run faster and use less memory. It is recommended to enable auto mixed precision training when running on platforms that support bfloat16 (Intel third or fourth generation Xeon processors). If it is enabled on a platform that does not support bfloat16, it can be detrimental to the training performance.

With the do_eval paramter set to True by default, this step will also show how the model can be evaluated. The model’s evaluate function returns a list of metrics calculated from the dataset’s validation subset.

Arguments

Required

  • dataset (ImageClassificationDataset, required): Dataset to use when training the model

  • output_dir (str): Path to a writeable directory for checkpoint files

  • epochs (int): Number of epochs to train the model (default: 1)

Optional

  • initial_checkpoints (str): Path to checkpoint weights to load. If the path provided is a directory, the latest checkpoint will be used.

  • early_stopping (bool): Enable early stopping if convergence is reached while training at the end of each epoch. (default: False)

  • lr_decay (bool): If lr_decay is True and do_eval is True, learning rate decay on the validation loss is applied at the end of each epoch.

  • enable_auto_mixed_precision (bool or None): Enable auto mixed precision for training. Mixed precision uses both 16-bit and 32-bit floating point types to make training run faster and use less memory. It is recommended to enable auto mixed precision training when running on platforms that support bfloat16 (Intel third or fourth generation Xeon processors). If it is enabled on a platform that does not support bfloat16, it can be detrimental to the training performance. If enable_auto_mixed_precision is set to None, auto mixed precision will be automatically enabled when running with Intel fourth generation Xeon processors, and disabled for other platforms.

  • extra_layers (list[int]): Optionally insert additional dense layers between the base model and output layer. This can help increase accuracy when fine-tuning a TFHub model. The input should be a list of integers representing the number and size of the layers, for example [1024, 512] will insert two dense layers, the first with 1024 neurons and the second with 512 neurons.

Note: refer to release documentation for an up-to-date list of train arguments and their current descriptions

[ ]:
# If enable_auto_mixed_precision is set to None, auto mixed precision will be automatically enabled when running
# with Intel fourth generation Xeon processors, and disabled for other platforms.
enable_auto_mixed_precision = None

history = model.train(dataset, output_dir, epochs=1, enable_auto_mixed_precision=enable_auto_mixed_precision)

Evaluate the trained model:

[ ]:
metrics_names = model._model.metrics_names
metrics = model.evaluate(dataset, enable_auto_mixed_precision)

for metric_name, metric_value in zip(metrics_names, metrics):
    print("{}: {}".format(metric_name, metric_value))

6. Predict

The model’s predict function can be called with a batch of data from the dataset.

[ ]:
# Get a single batch from the dataset object
data_batch, labels = dataset.get_batch()

# Call predict using the batch
batch_predictions = model.predict(data_batch, enable_auto_mixed_precision=enable_auto_mixed_precision)

# Maximum number of rows to show in the data frame
max_items = 10
num_classes = len(dataset.class_names)
# Collect the sentence text, score, and actual label for the batch
prediction_list = []
for i, (text, actual_label) in enumerate(zip(data_batch, labels)):
    sentence = text.numpy().decode('utf-8')
    score = batch_predictions[i]
    if num_classes == 2:
        prediction = float(score)
    else:
        prediction = float(np.argmax(score))

    prediction_list.append([sentence,
                            max(tf.get_static_value(score)),
                            dataset.get_str_label(prediction),
                            dataset.get_str_label(int(actual_label.numpy()))])
    if i + 1 >= max_items:
        break

# Display the results using a data frame
result_df = pd.DataFrame(prediction_list, columns=["Input Text", "Prediction Score", "Prediction", "Actual Label"])
# Center the column headers and hide the index
result_df.style.set_table_styles([{'selector': 'th', 'props': [('text-align', 'center')]}]).hide(axis="index")

Raw text can also be passed to the predict function.

[ ]:
score = model.predict("Awesome movie!")

if num_classes == 2:
    result = float(score)
else:
    result = float(np.argmax(score))

print("Predicted score:", np.max(score))
print("Predicted label:", dataset.get_str_label(float(result)))

7. Export the saved model

Lastly, we can call the Intel Transfer Learning Tool model export function to generate a saved_model.pb. The model is saved in a format that is ready to use with TensorFlow Serving. Each time the model is exported, a new numbered directory is created, which allows serving to pick up the latest model.

[ ]:
saved_model_dir = model.export(output_dir)

8. Quantization

In this section, the Intel Transfer Learning Tool API uses Intel® Neural Compressor (INC) to quantize the model to get optimal inference performance.

We use the Intel Neural Compressor config to benchmark the full precision model to see how it performs, as our baseline.

Note that there is a known issue when running Intel Neural Compressor from a notebook that you may sometimes see the error zmq.error.ZMQError: Address already in use. If you see this error, rerun the cell again.

[ ]:
results = model.benchmark(dataset)

Next we use Intel Neural Compressor to automatically search for the optimal quantization recipe for low-precision model inference. Running post training quantization may take several minutes.

[ ]:
inc_output_dir = os.path.join(output_dir, 'quantized_models', model.model_name,
                                       os.path.basename(saved_model_dir))
model.quantize(inc_output_dir, dataset)

Let’s benchmark using the quantized model, so that we can compare the performance to the full precision model that was originally benchmarked.

[ ]:
quantized_results = model.benchmark(dataset=dataset, saved_model_dir=inc_output_dir)

Let’s also inspect the disk size of the pre- and post-quantization model files:

[ ]:
print('The size of the un-compressed model:')
!du -h {saved_model_dir}
[ ]:
print('The size of the compressed model:')
!du -h {inc_output_dir}

Citations

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

@misc{zhang2015characterlevel,
    title={Character-level Convolutional Networks for Text Classification},
    author={Xiang Zhang and Junbo Zhao and Yann LeCun},
    year={2015},
    eprint={1509.01626},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@misc{misc_sms_spam_collection_228,
  author       = {Almeida, Tiago},
  title        = {{SMS Spam Collection}},
  year         = {2012},
  howpublished = {UCI Machine Learning Repository}
}

Please see this dataset’s applicable license for terms and conditions. Intel Corporation does not own the rights to this data set and does not confer any rights to it.