tlt.datasets.text_classification.hf_text_classification_dataset.HFTextClassificationDataset¶

class tlt.datasets.text_classification.hf_text_classification_dataset.HFTextClassificationDataset(dataset_dir, dataset_name, split=['train'], num_workers=0, shuffle_files=True, distributed=False)[source]¶

A text classification dataset from the Hugging Face datasets catalog

__init__(dataset_dir, dataset_name, split=['train'], num_workers=0, shuffle_files=True, distributed=False)[source]¶: Class constructor

Methods

`__init__`(dataset_dir, dataset_name[, split, ...])	Class constructor
`get_batch`([subset])	Get a single batch of images and labels from the dataset.
`get_inc_dataloaders`()
`get_str_label`(numerical_value)	Returns the string label (class name) associated with the specified numerical value.
`get_text`(input_ids)	Helper function to decode the input_ids to text
`load_hf_dataset`(dataset_name, split)	Helper function to load the dataset from hugging face catalog
`preprocess`(model_name[, batch_size, ...])	Preprocess the textual dataset to apply padding, truncation and tokenize.
`shuffle_split`([train_pct, val_pct, ...])	Randomly split the dataset into train, validation, and test subsets with a pseudo-random seed option.

Attributes

`class_names`	Returns a list of class labels
`dataset`	Returns datasets.arrow_dataset.Dataset object
`dataset_catalog`	The string name of the dataset catalog (or None)
`dataset_dir`	Host directory containing the dataset files
`dataset_name`	Name of the dataset
`info`	Returns a dictionary of information about the dataset
`test_loader`
`test_subset`	A subset of the dataset held out for final testing/evaluation
`train_loader`
`train_subset`	A subset of the dataset used for training
`validation_loader`
`validation_subset`	A subset of the dataset used for validation/evaluation