tlt.datasets.text_classification.hf_text_classification_dataset.HFTextClassificationDataset

class tlt.datasets.text_classification.hf_text_classification_dataset.HFTextClassificationDataset(dataset_dir, dataset_name, split=['train'], num_workers=0, shuffle_files=True, distributed=False)[source]

A text classification dataset from the Hugging Face datasets catalog

__init__(dataset_dir, dataset_name, split=['train'], num_workers=0, shuffle_files=True, distributed=False)[source]

Class constructor

Methods

__init__(dataset_dir, dataset_name[, split, ...])

Class constructor

get_batch([subset])

Get a single batch of images and labels from the dataset.

get_inc_dataloaders()

get_str_label(numerical_value)

Returns the string label (class name) associated with the specified numerical value.

get_text(input_ids)

Helper function to decode the input_ids to text

load_hf_dataset(dataset_name, split)

Helper function to load the dataset from hugging face catalog

preprocess(model_name[, batch_size, ...])

Preprocess the textual dataset to apply padding, truncation and tokenize.

shuffle_split([train_pct, val_pct, ...])

Randomly split the dataset into train, validation, and test subsets with a pseudo-random seed option.

Attributes

class_names

Returns a list of class labels

dataset

Returns datasets.arrow_dataset.Dataset object

dataset_catalog

The string name of the dataset catalog (or None)

dataset_dir

Host directory containing the dataset files

dataset_name

Name of the dataset

info

Returns a dictionary of information about the dataset

test_loader

test_subset

A subset of the dataset held out for final testing/evaluation

train_loader

train_subset

A subset of the dataset used for training

validation_loader

validation_subset

A subset of the dataset used for validation/evaluation