Text Classification fine tuning using Pytorch and the Intel® Transfer Learning Tool API

This notebook uses the tlt library to fine tune a Hugging Face pretrained model for text classification.

1. Import dependencies and setup parameters

This notebook assumes that you have already followed the instructions to setup a Pytorch environment with all the dependencies required to run the notebook.

[ ]:
import numpy as np
import os
import pandas as pd

# tlt imports
from tlt.datasets import dataset_factory
from tlt.models import model_factory
from tlt.utils.file_utils import download_and_extract_zip_file

# Specify a directory for the dataset to be downloaded
dataset_dir = os.environ["DATASET_DIR"] if "DATASET_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "dataset")

# Specify a directory for output
output_dir = os.environ["OUTPUT_DIR"] if "OUTPUT_DIR" in os.environ else \
    os.path.join(os.environ["HOME"], "output")

print("Dataset directory:", dataset_dir)
print("Output directory:", output_dir)

2. Get the model

In this step, we call the Intel Transfer Learning Tool model factory to list supported Hugging Face text classification models. This is a list of pretrained models from Hugging Face that we tested with our API. Optionally, the verbose=True argument can be added to the print_supported_models() function call to get more information about each model (such as the links to Huggingface, the original dataset, etc).

[ ]:
# See a list of available text classification models
model_factory.print_supported_models(use_case='text_classification', framework='pytorch')

Use the Intel Transfer Learning Tool model factory to get one of the models listed in the previous cell. The get_model function returns a TLT model object that will later be used for training.

[ ]:
model_name = "bert-base-cased"
framework = "pytorch"

model = model_factory.get_model(model_name, framework)

print("Model name:", model.model_name)
print("Framework:", model.framework)
print("Use case:", model.use_case)

3. Get the dataset

Option A: Use your own dataset

This option allows for using your own text classification dataset from a .csv file. The dataset factory will expect text classification .csv files to have two columns where the first column is the label and the second column is the text/sentence to classify.

For example, the contents of a comma separated value file should look similar to this:

<label>,<text>
<label>,<text>
<label>,<text>

If the .csv has more columns, the select_cols or exclude_cols parameters can be used to filter out which columns are parsed.

This example is downloading the SMS Spam Collection dataset, which has a tab separated value file in the .zip file. This dataset has labeled SMS text messages that are either being classified as ham or spam. The first column in the data file has the label (ham or spam) and the second column is the text of the SMS mesage. (Note: Please see this dataset’s applicable license for terms and conditions. Intel Corporation does not own the rights to this data set and does not confer any rights to it.)

When using your own dataset, update the path to your dataset directory, as well the other variables with properties about the dataset like the csv file name, class names, delimiter, header, and the map function (if string labels need to be translated into numerical values).

[ ]:
# Modify the variables below to use a different dataset or a csv file on your local system.
# The csv_path variable should be pointing to a csv file with 2 columns (the label and the text)
dataset_url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
dataset_dir = os.path.join(dataset_dir, "sms_spam_collection")
csv_name = "SMSSpamCollection"
delimiter = "\t"
label_names = ["ham", "spam"]

# Rename the file to include the csv extension so that the dataset API knows how to load the file
renamed_csv = "{}.csv".format(csv_name)
print(renamed_csv)

# If we don't already have the csv file, download and extract the zip file to get it.
if not os.path.exists(os.path.join(dataset_dir, csv_name)) and \
                      not os.path.exists(os.path.join(dataset_dir, renamed_csv)):
    download_and_extract_zip_file(dataset_url, dataset_dir)

if not os.path.exists(os.path.join(dataset_dir, renamed_csv)):
    os.rename(os.path.join(dataset_dir, csv_name), os.path.join(dataset_dir, renamed_csv))
[ ]:
dataset = dataset_factory.load_dataset(dataset_dir=dataset_dir, use_case="text_classification",
                                       framework="pytorch", csv_file_name=renamed_csv, class_names=label_names,
                                       column_names=["label", "text"], delimiter=delimiter, header=None)

print(dataset.info)
print("\nClass names:", str(dataset.class_names))

Skip to the next step 4. Prepare the dataset to continue using your own dataset.

Option B: Use the HuggingFace catalog

Option B allows for using a dataset from the HuggingFace datasets catalog. Current supported datasets: - imdb - tweet_eval/ – emoji, emotion, hate, irony, offensive, sentiment, stance_abortion, stance_atheism, stance_climate, stance_feminist, stance_hillary - rotten_tomatoes - ag_news - sst2

[ ]:
dataset_name = "tweet_eval/sentiment"
dataset = dataset_factory.get_dataset(dataset_dir, model.use_case, model.framework, dataset_name,
                                      dataset_catalog="huggingface", shuffle_files=True)

print(dataset.info)
print("\nClass names:", str(dataset.class_names))

4. Prepare the dataset

Once you have your dataset from Option A or Option B above, use the following cell to preprocess the dataset. The dataset is batched and then split into subsets for training and validation.

[ ]:
# Batch the dataset and create splits for training and validation
dataset.preprocess(model_name, batch_size=32)
dataset.shuffle_split(train_pct=0.75, val_pct=0.25)

5. Fine tuning

The Intel Transfer Learning Tool model’s train function is called with the dataset that was just prepared, along with an output directory for checkpoints, and the number of training epochs.

With the do_eval parameter set to True by default, this step will also show how the model can be evaluated. The model’s evaluate function returns a list of metrics calculated from the dataset’s validation subset.

Arguments

Required

  • dataset (ImageClassificationDataset, required): Dataset to use when training the model

  • output_dir (str): Path to a writeable directory for checkpoint files

  • epochs (int): Number of epochs to train the model (default: 1)

Optional

  • initial_checkpoints (str): Path to checkpoint weights to load. If the path provided is a directory, the latest checkpoint will be used.

  • early_stopping (bool): Enable early stopping if convergence is reached while training at the end of each epoch. (default: False)

  • lr_decay (bool): If lr_decay is True and do_eval is True, learning rate decay on the validation loss is applied at the end of each epoch.

  • extra_layers (list[int]): Optionally insert additional dense layers between the base model and output layer. This can help increase accuracy when fine-tuning a TFHub model. The input should be a list of integers representing the number and size of the layers, for example [1024, 512] will insert two dense layers, the first with 1024 neurons and the second with 512 neurons.

  • use_trainer (bool): If use_trainer is True, then the model training is done using the Hugging Face Trainer and if use_trainer is False, the model training is done using native PyTorch training loop

  • enable_auto_mixed_precision (bool or None): Enable auto mixed precision for training. Mixed precision uses both 16-bit and 32-bit floating point types to make training run faster and use less memory. It is recommended to enable auto mixed precision training when running on platforms that support bfloat16 (Intel third or fourth generation Xeon processors). If it is enabled on a platform that does not support bfloat16, it can be detrimental to the training performance. If enable_auto_mixed_precision is set to None, auto mixed precision will be automatically enabled when running with Intel fourth generation Xeon processors, and disabled for other platforms. Defaults to None.

Note: refer to release documentation for an up-to-date list of train arguments and their current descriptions

[ ]:
enable_auto_mixed_precision = None

history = model.train(dataset, output_dir, epochs=1, ipex_optimize=True, use_trainer=False, enable_auto_mixed_precision=enable_auto_mixed_precision)

A complete model summary can be printed for all modules in case any need to be unfrozen:

[ ]:
model.list_layers(verbose=True)

Layers can be unfrozen by passing their string names, such as the following:

[ ]:
model.unfreeze_layer("features") # Unfreezes the features layers
model.list_layers(verbose=True)

6. Predict

The model’s predict function can be called with a batch of data from the dataset.

[ ]:
# Get a single batch from the dataset object
data_batch = dataset.get_batch()

# Call predict using the batch
batch_predictions = model.predict(data_batch, enable_auto_mixed_precision=enable_auto_mixed_precision)

# Maximum number of rows to show in the data frame
max_items = 10
# Collect the sentence text, score, and actual label for the batch
prediction_list = []

for i, tensor in enumerate(data_batch['input_ids']):
    sentence = dataset.get_text(tensor)[0]
    score = batch_predictions[i]
    prediction_list.append([sentence,
                            dataset.get_str_label(float(score)),
                            dataset.get_str_label(float(data_batch['label'][i]))])
    if i + 1 >= max_items:
        break

# Display the results using a data frame
result_df = pd.DataFrame(prediction_list, columns=["Input Text", "Predicted Label", "Actual Label"])
result_df.style.hide(axis="index")

Predict on Text

Raw text can also be passed to the predict function.

[ ]:
result = model.predict("Good movie")

print("Predicted score:", float(result))
print("Predicted label:", dataset.get_str_label(float(result)))

7. Export the saved model

Lastly, we can call the model export function to generate a saved_model.pb. Each time the model is exported, a new numbered directory is created, which allows serving to pick up the latest model.

[ ]:
saved_model_dir = model.export(output_dir)

8. Quantization

In this section, the Intel Transfer Learning Tool API uses Intel® Neural Compressor (INC) to quantize the model to get optimal inference performance.

We use the Intel Neural Compressor config to benchmark the full precision model to see how it performs, as our baseline.

Note that there is a known issue when running Intel Neural Compressor from a notebook that you may sometimes see the error zmq.error.ZMQError: Address already in use. If you see this error, rerun the cell again.

[ ]:
result = model.benchmark(dataset)

Next we use Intel Neural Compressor to automatically search for the optimal quantization recipe for low-precision model inference. Running post training quantization may take several minutes.

Next we use INC to automatically search for the optimal quantization recipe for low-precision model inference within the accuracy loss constrains defined in the config. Running post training quantization may take several minutes, depending on your hardware and the exit policy (timeout and max trials).

[ ]:
inc_output_dir = os.path.join(output_dir, 'quantized_models', model.model_name,
                                       os.path.basename(saved_model_dir))
model.quantize(inc_output_dir, dataset)

Let’s benchmark using the quantized model, so that we can compare the performance to the full precision model that was originally benchmarked.

[ ]:
quantized_result = model.benchmark(dataset=dataset, saved_model_dir=inc_output_dir)

You can inspect the disk size of the pre- and post-quantization model files:

[ ]:
print('The size of the un-compressed model:')
!du -h {saved_model_dir}
[ ]:
print('The size of the compressed model:')
!du -h {inc_output_dir}

Citations

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

@inproceedings{rosenthal2017semeval,
  title={SemEval-2017 task 4: Sentiment analysis in Twitter},
  author={Rosenthal, Sara and Farra, Noura and Nakov, Preslav},
  booktitle={Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017)},
  pages={502--518},
  year={2017}
}

@misc{misc_sms_spam_collection_228,
  author       = {Almeida, Tiago},
  title        = {{SMS Spam Collection}},
  year         = {2012},
  howpublished = {UCI Machine Learning Repository}
}

Please see this dataset’s applicable license for terms and conditions. Intel Corporation does not own the rights to this data set and does not confer any rights to it.