tlt.datasets.text_classification.hf_custom_text_classification_dataset.HFCustomTextClassificationDataset¶

class tlt.datasets.text_classification.hf_custom_text_classification_dataset.HFCustomTextClassificationDataset(dataset_dir, dataset_name: Optional[str], csv_file_name: str, class_names: Optional[List[str]] = None, column_names: Optional[List[str]] = None, label_map_func: Optional[callable] = None, label_col: Optional[int] = 0, delimiter: Optional[str] = ',', header: Optional[bool] = False, select_cols: Optional[List[int]] = None, exclude_cols: Optional[List[int]] = None, shuffle_files: Optional[bool] = True, num_workers: Optional[int] = 0)[source]¶

A custom text classification dataset that can be used with Transformer models.

__init__(dataset_dir, dataset_name: Optional[str], csv_file_name: str, class_names: Optional[List[str]] = None, column_names: Optional[List[str]] = None, label_map_func: Optional[callable] = None, label_col: Optional[int] = 0, delimiter: Optional[str] = ',', header: Optional[bool] = False, select_cols: Optional[List[int]] = None, exclude_cols: Optional[List[int]] = None, shuffle_files: Optional[bool] = True, num_workers: Optional[int] = 0)[source]¶

A custom text classification dataset that can be used with Transformer models. Note that this dataset class expects a .csv file with two columns where the first column is the label and the second column is the text/sentence to classify.

For example, a comma separated value file will look similar to the snippet below:

class_a,<text>
class_b,<text>
class_a,<text>
...

If the .csv files has more columns, the select_cols or exclude_cols parameters can be used to filter out which columns will be parsed.

Parameters

dataset_dir (str) – Directory containing the dataset
dataset_name (str) – Name of the dataset. If no dataset name is given, the dataset_dir folder name will be used as the dataset name.
csv_file_name (str) – Name of the file to load from the dataset directory
class_names (list(str)) – optional; List of ordered class names. If None, class_names are inferred from label_col column
column_names (list(str)) – optional; List of column names. If given, there must be exactly one value as “label” in the position corresponding to the ‘label_col’ argument. If None, column names are assigned as “label” for the label_col column and “text_1”, “text_2”, … for the rest of the columns.
label_map_func (function) – optional; Maps the label_map_func across the label column of the dataset to apply a transform to the elements. For example, if the .csv file has string class labels instead of numerical values, you can provide a function that maps the string to a numerical value or specify the index of the label column to apply a default label_map_func which assigns an integer for every unique class label, starting with 0.
label_col (int) – optional; Column index of the dataset to use as label column. Defaults to “0”
delimiter (str) – String character that separates the text in each row. Defaults to “,”
header (bool) – optional; Boolean indicating whether or not the csv file has a header line that should be skipped. Defaults to False.
select_cols (list) – optional; Specify a list of sorted indices for columns from the dataset file(s) that should be parsed. Defaults to parsing all columns. At most one of select_cols and exclude_cols can be specified.
exclude_cols (list) – optional; Specify a list of sorted indices for columns from the dataset file(s) that should be excluded from parsing. Defaults to parsing all columns. At most one of select_cols and exclude_cols can be specified.
shuffle_files (bool) – optional; Whether to shuffle the data. Defaults to True.
num_workers (int) – Number of workers to pass into a DataLoader.

Raises

FileNotFoundError – if the csv file is not found in the dataset directory
TypeError – if label_map_func is not callable
ValueError – if class_names list is empty
ValueError – if column_names list does not contain the value ‘label’
ValueError – if index of ‘label’ in column_names and label_col mismatch
ValueError – if the values of column_names are not strings.
ValueError – if column_names contains more than one value as ‘label’

Methods

`__init__`(dataset_dir, dataset_name, ...[, ...])	A custom text classification dataset that can be used with Transformer models.
`get_batch`([subset])	Get a single batch of images and labels from the dataset.
`get_inc_dataloaders`()
`get_str_label`(numerical_value)	Returns the string label (class name) associated with the specified numerical value.
`get_text`(input_ids)	Helper function to decode the input_ids to text
`preprocess`(model_name[, batch_size, ...])	Preprocess the textual dataset to apply padding, truncation and tokenize.
`shuffle_split`([train_pct, val_pct, ...])	Randomly split the dataset into train, validation, and test subsets with a pseudo-random seed option.

Attributes

`class_names`
`dataset`	The framework dataset object
`dataset_catalog`	The string name of the dataset catalog (or None)
`dataset_dir`	Host directory containing the dataset files
`dataset_name`	Name of the dataset
`info`
`test_loader`
`test_subset`	A subset of the dataset held out for final testing/evaluation
`train_loader`
`train_subset`	A subset of the dataset used for training
`validation_loader`
`validation_subset`	A subset of the dataset used for validation/evaluation