tlt.datasets.text_generation.hf_custom_text_generation_dataset.HFCustomTextGenerationDataset¶

class tlt.datasets.text_generation.hf_custom_text_generation_dataset.HFCustomTextGenerationDataset(dataset_dir, dataset_name: Optional[str], dataset_file: str, validation_file: Optional[str] = None, num_workers: int = 0, shuffle_files: bool = True, seed: Optional[int] = None)[source]¶

A custom text generation dataset that can be used with Transformer models.

__init__(dataset_dir, dataset_name: Optional[str], dataset_file: str, validation_file: Optional[str] = None, num_workers: int = 0, shuffle_files: bool = True, seed: Optional[int] = None)[source]¶

A custom text generation dataset that can be used with Transformer models. Note that this dataset class expects a .json, .txt, or .csv file with records that contain up to three keys, such as “instruction”, “input”, and “output”.

For example, a json-formatted file will look similar to the snippet below:

[

{: “instruction”: “What are the three primary colors?”, “input”: “”, “output”: “The three primary colors are red, blue, and yellow.”

}, {

“instruction”: “Identify the odd one out.”, “input”: “Twitter, Instagram, Telegram”, “output”: “Telegram”

]

Parameters

dataset_dir (str) – Directory containing the dataset
dataset_name (str) – Name of the dataset. If no dataset name is given, the dataset_dir folder name will be used as the dataset name.
dataset_file (str) – Name of the training file to load from the dataset directory; must be .json, .txt, or .csv
validation_file (str) – Optional, name of the validation file to load from the dataset directory; must be .json, .txt, or .csv
num_workers (int) – Number of workers to pass into a DataLoader.
shuffle_files (bool) – optional; Whether to shuffle the data. Defaults to True.
seed (int) – optional; Random seed for shuffling

Raises

FileNotFoundError – if the file is not found in the dataset directory

Methods

`__init__`(dataset_dir, dataset_name, dataset_file)	A custom text generation dataset that can be used with Transformer models.
`get_batch`([subset])	Get a single batch of images and labels from the dataset.
`get_inc_dataloaders`()
`get_text`(input_ids)	Helper function to decode the input_ids to text
`preprocess`(model_name[, batch_size, ...])	Preprocess the textual dataset to apply padding, truncation and tokenize.
`shuffle_split`([train_pct, val_pct, ...])	Randomly split the dataset into train, validation, and test subsets with a pseudo-random seed option.

Attributes

`dataset`	The framework dataset object
`dataset_catalog`	The string name of the dataset catalog (or None)
`dataset_dir`	Host directory containing the dataset files
`dataset_name`	Name of the dataset
`info`
`test_loader`
`test_subset`	A subset of the dataset held out for final testing/evaluation
`train_loader`
`train_subset`	A subset of the dataset used for training
`validation_loader`
`validation_subset`	A subset of the dataset used for validation/evaluation