tlt.datasets.text_classification.tf_custom_text_classification_dataset.TFCustomTextClassificationDataset

class tlt.datasets.text_classification.tf_custom_text_classification_dataset.TFCustomTextClassificationDataset(dataset_dir, dataset_name, csv_file_name, class_names=[], label_map_func=None, defaults=[tf.string, tf.string], delimiter=',', header=False, select_cols=None, exclude_cols=None, shuffle_files=True, seed=None, **kwargs)[source]

A custom text classification dataset that can be used with TensorFlow models. Note that this dataset class expects a .csv file with two columns where the first column is the label and the second column is the text/sentence to classify.

For example, a comma separated value file will look similar to the snippet below:

class_a,<text>
class_b,<text>
class_a,<text>
...

If the .csv files has more columns, the select_cols or exclude_cols parameters can be used to filter out which columns will be parsed.

Parameters
  • dataset_dir (str) – Directory containing the dataset

  • dataset_name (str) – Name of the dataset. If no dataset name is given, the dataset_dir folder name will be used as the dataset name.

  • csv_file_name (str) – Name of the csv file to load from the dataset directory

  • class_names (list) – List of ordered class names

  • label_map_func (function) – optional; Maps the label_map_func across the label column of the dataset to apply a transform to the elements. For example, if the .csv file has string class labels instead of numerical values, provide a function that maps the string to a numerical value.

  • defaults (list) – optional; List of default values for the .csv file fields. Defaults to [tf.string, tf.string]

  • delimiter (str) – optional; String character that separates the label and text in each row. Defaults to “,”.

  • header (bool) – optional; Boolean indicating whether or not the csv file has a header line that should be skipped. Defaults to False.

  • select_cols (list) – optional; Specify a list of sorted indices for columns from the dataset file(s) that should be parsed. Defaults to parsing all columns. At most one of select_cols and exclude_cols can be specified.

  • exclude_cols (list) – optional; Specify a list of sorted indices for columns from the dataset file(s) that should be excluded from parsing. Defaults to parsing all columns. At most one of select_cols and exclude_cols can be specified.

  • shuffle_files (bool) – optional; Whether to shuffle the data. Defaults to True.

  • seed (int) – optional; Random seed for shuffling

Raises
  • FileNotFoundError – if the csv file is not found in the dataset directory

  • TypeError – if the class_names parameter is not a list or the label_map_func is not callable

  • ValueError – if the class_names list is empty

__init__(dataset_dir, dataset_name, csv_file_name, class_names=[], label_map_func=None, defaults=[tf.string, tf.string], delimiter=',', header=False, select_cols=None, exclude_cols=None, shuffle_files=True, seed=None, **kwargs)[source]

Class constructor

Methods

__init__(dataset_dir, dataset_name, ...[, ...])

Class constructor

get_batch([subset])

Get a single batch of images and labels from the dataset.

get_inc_dataloaders(hub_name, max_seq_length)

get_str_label(numerical_value)

Returns the string label (class name) associated with the specified numerical value.

preprocess(batch_size)

Batch the dataset

shuffle_split([train_pct, val_pct, ...])

Randomly split the dataset into train, validation, and test subsets with a pseudo-random seed option.

Attributes

class_names

Returns the list of class names

dataset

Returns the framework dataset object (tf.data.Dataset)

dataset_catalog

The string name of the dataset catalog (or None)

dataset_dir

Host directory containing the dataset files

dataset_name

Name of the dataset

info

Returns a dictionary of information about the dataset

test_subset

A subset of the dataset held out for final testing/evaluation

train_subset

A subset of the dataset used for training

validation_subset

A subset of the dataset used for validation/evaluation