tlt.datasets.text_classification.hf_text_classification_dataset.HFTextClassificationDataset¶
- class tlt.datasets.text_classification.hf_text_classification_dataset.HFTextClassificationDataset(dataset_dir, dataset_name, split=['train'], num_workers=0, shuffle_files=True, distributed=False)[source]¶
A text classification dataset from the Hugging Face datasets catalog
- __init__(dataset_dir, dataset_name, split=['train'], num_workers=0, shuffle_files=True, distributed=False)[source]¶
Class constructor
Methods
__init__(dataset_dir, dataset_name[, split, ...])Class constructor
get_batch([subset])Get a single batch of images and labels from the dataset.
get_inc_dataloaders()get_str_label(numerical_value)Returns the string label (class name) associated with the specified numerical value.
get_text(input_ids)Helper function to decode the input_ids to text
load_hf_dataset(dataset_name, split)Helper function to load the dataset from hugging face catalog
preprocess(model_name[, batch_size, ...])Preprocess the textual dataset to apply padding, truncation and tokenize.
shuffle_split([train_pct, val_pct, ...])Randomly split the dataset into train, validation, and test subsets with a pseudo-random seed option.
Attributes
class_namesReturns a list of class labels
datasetReturns datasets.arrow_dataset.Dataset object
dataset_catalogThe string name of the dataset catalog (or None)
dataset_dirHost directory containing the dataset files
dataset_nameName of the dataset
infoReturns a dictionary of information about the dataset
test_loadertest_subsetA subset of the dataset held out for final testing/evaluation
train_loadertrain_subsetA subset of the dataset used for training
validation_loadervalidation_subsetA subset of the dataset used for validation/evaluation