Data Augmentation
============

1. [Introduction](#introduction)

2. [Getting Started](#getting-started)

    2.1. [Install Dependency](#install-dependency)

    2.2. [Install Intel Extension for Transformers](#install-intel_extension_for_transformers)

3. [Data Augmentation](#data-augmentation)

    3.1. [Script](#script)

    3.2. [Parameters of Data Augmentation](#parameters-of-data-augmentation)

    3.3. [Supported Augmenter](#supported-augmenter)

    3.4. [Text Generation Augmenter](#text-generation-augmenter)

    3.5. [Augmenter Arguments](#augmenter-arguments)

## Introduction
Data Augmentation is a tool to help with augmenting NLP datasets for machine learning projects. This tool integrates [nlpaug](https://github.com/makcedward/nlpaug) and other methods from Intel Lab.

## Getting Started
### Install Dependency
```bash
pip install nlpaug
pip install transformers
```

### Install Intel Extension for Transformers
```bash
git clone https://github.com/intel/intel-extension-for-transformers.git itrex
cd itrex
pip install -r requirements.txt
pip install -v .
```

## Data Augmentation
### Script
Please refer to [example](tests/test_data_augmentation.py).
```python
from intel_extension_for_transformers.utils.data_augmentation import DataAugmentation
aug = DataAugmentation(augmenter_type="TextGenerationAug")
aug.input_dataset = "dev.csv"
aug.output_path = os.path.join(self.result_path, "test1.cvs")
aug.augmenter_arguments = {'model_name_or_path': 'gpt2-medium'}
aug.data_augment()
raw_datasets = load_dataset("csv", data_files=aug.output_path, delimiter="\t", split="train")
self.assertTrue(len(raw_datasets) == 10)
```

### Parameters of Data Augmentation
|Parameter |Type |Description                                                           |Default value |
|:---------|:----|:------------------------------------------------------------------|:-------------|
|augmenter_type|String|Augmentation type                                             |NA  |
|input_dataset|String|Dataset name or a csv or a json file                           |None  |
|output_path|String|Saved path and name of augmented data file                       |"save_path/augmented_dataset.csv"|
|data_config_or_task_name|String|Task name of glue dataset or data configure name    |None  |
|augmenter_arguments|Dict|Parameters for augmenters. Different augmenter has different parameters |None|
|column_names|String|The column needed to conduct augmentation, which is used for python package datasets|"sentence"|
|split|String|Dataset needed to conduct augmentation, like:'validation', 'training'     |"validation"  |
|num_samples|Integer|The number of the generated augmentation samples           |1  |
|device|String|Deployment devices, "cuda" or "cpu"                                     |1  |

### Supported Augmenter
|augmenter_type |augmenter_arguments                                                 |default value |
|:--------------|:-------------------------------------------------------------------|:-------------|
|"TextGenerationAug"|Refer to "Text Generation Augmenter" field in this document               |NA  |
|"KeyboardAug"|Refer to ["KeyboardAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/keyboard.py#L46)      |NA  |
|"OcrAug"|Refer to ["OcrAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/char/ocr.py#L38)           |NA  |
|"SpellingAug"|Refer to ["SpellingAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/word/spelling.py#L49)      |NA  |
|"ContextualWordEmbsForSentenceAug"|Refer to ["ContextualWordEmbsForSentenceAug"](https://github.com/makcedward/nlpaug/blob/40794970124c26ce2e587e567738247bf20ebcad/nlpaug/augmenter/sentence/context_word_embs_sentence.py#L77)      |    |

### Text Generation Augmenter
The text generation augment contains recipe to run data augmentation algorithm based on the conditional text generation using auto-regressive transformer models (like GPT, GPT-2, Transformer-XL, XLNet, CTRL) in order to automatically generate labeled data.
Our approach follows algorithms described by [Not Enough Data? Deep Learning to the Rescue!](https://arxiv.org/abs/1911.03118) and [Natural Language Generation for Effective Knowledge Distillation](https://www.aclweb.org/anthology/D19-6122.pdf).

- First, we fine-tune an auto-regressive model on the training set. Each sample contains both a label and a sentence.
    - Prepare datasets:
        ```python
        from datasets import load_dataset
        from intel_extension_for_transformers.utils.utils import EOS
        for split in {'train', 'validation'}:
            dataset = load_dataset('glue', 'sst2', split=split)
            with open('SST-2/' + split + '.txt', 'w') as fw:
                for d in dataset:
                    fw.write(str(d['label']) + '\t' + d['sentence'] + EOS + '\n')
        ```

    - Fine-tune Causal Language Model

        You can use the script [run_clm.py](https://github.com/huggingface/transformers/tree/v4.6.1/examples/pytorch/language-modeling/run_clm.py) from transformers examples for fine-tuning GPT2 (gpt2-medium) on SST-2 task. The loss is that of causal language modeling. 

        ```shell
        DATASET=SST-2
        TRAIN_FILE=$DATASET/train.txt
        VALIDATION_FILE=$DATASET/validation.txt
        MODEL=gpt2-medium
        MODEL_DIR=model/$MODEL-$DATASET

        python3 transformers/examples/pytorch/language-modeling/run_clm.py \
            --model_name_or_path $MODEL \
            --train_file $TRAIN_FILE \
            --validation_file $VALIDATION_FILE \
            --do_train \
            --do_eval \
            --output_dir $MODEL_DIR \
            --overwrite_output_dir
        ```

- Secondly, we generate labeled data. Given class labels sampled from the training set, we use the fine-tuned language model to predict sentences with below script:
    ```python
    from intel_extension_for_transformers.utils.data_augmentation import DataAugmentation
    aug = DataAugmentation(augmenter_type="TextGenerationAug")
    aug.input_dataset = "/your/original/training_set.csv"
    aug.output_path = os.path.join(self.result_path, "/your/augmented/dataset.cvs")
    aug.augmenter_arguments = {'model_name_or_path': '/your/fine-tuned/model'}
    aug.data_augment()
    ```

This data augmentation algorithm can be used in several scenarios, like model distillation.

### Augmenter Arguments:
|Parameter |Type|Description                                                 |Default value |
|:---------|:---|:---------------------------------------------------|:-------------|
|"model_name_or_path"|String|Language modeling model to generate data, refer to [line](intel_extension_for_transformers/preprocessing/data_augmentation.py#L181)|NA|
|"stop_token"|String|Stop token used in input data file                     |[EOS](intel_extension_for_transformers/preprocessing/utils.py#L7)|
|"num_return_sentences"|Integer|Total samples to generate, -1 means the number of the input samples                    |-1|
|"temperature"|float|parameter for CLM model                               |1.0|
|"k"|float|top K                                |0.0|
|"p"|float|top p                                |0.9|
|"repetition_penalty"|float|repetition_penalty                                |1.0|