neural_compressor.experimental.data.datasets.bert_dataset

Built-in BERT datasets class for multiple framework backends.

Module Contents

Classes

PytorchBertDataset

PyTorch dataset used for model Bert.

ONNXRTBertDataset

ONNXRT dataset used for model Bert.

InputFeatures

Single set of features of data.

TensorflowBertDataset

Tensorflow dataset used for model Bert.

ParseDecodeBert

Helper function for TensorflowModelZooBertDataset.

TensorflowModelZooBertDataset

Tensorflow dataset for three-input Bert in tf record format.

Functions

load_and_cache_examples(data_dir, model_name_or_path, ...)

Load and cache the examples.

convert_examples_to_features(examples, tokenizer[, ...])

Convert examples to features.

class neural_compressor.experimental.data.datasets.bert_dataset.PytorchBertDataset(dataset, task, model_type='bert', transform=None, filter=None)

Bases: neural_compressor.experimental.data.datasets.dataset.Dataset

PyTorch dataset used for model Bert.

This Dataset is to construct from the Bert TensorDataset and not a full implementation from yaml config. The original repo link is: https://github.com/huggingface/transformers. When you want use this Dataset, you should add it before you initialize your DataLoader. (TODO) add end to end support for easy config by yaml by adding the method of load examples and process method.

Args: dataset (list): list of data.

task (str): the task of the model, support “classifier”, “squad”. model_type (str, default=’bert’): model type, support ‘distilbert’, ‘bert’,

‘xlnet’, ‘xlm’.

transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according

to specific conditions.

Examples

dataset = [[

[101,2043,2001], [1,1,1], [[0,0,0,0,0,0,0],

[0,0,0,0,0,0,0], [0,0,0,0,0,0,0]],

[1,1,1], [1,1,1], [[0,0,0,0,0,0,0],

[0,0,0,0,0,0,0], [0,0,0,0,0,0,0]]

]] dataset = PytorchBertDataset(dataset=dataset, task=’classifier’, model_type=’bert’,

transform=preprocess, filter=filter)

class neural_compressor.experimental.data.datasets.bert_dataset.ONNXRTBertDataset(data_dir, model_name_or_path, max_seq_length=128, do_lower_case=True, task='mrpc', model_type='bert', dynamic_length=False, evaluate=True, transform=None, filter=None)

Bases: neural_compressor.experimental.data.datasets.dataset.Dataset

ONNXRT dataset used for model Bert.

Args: data_dir (str): The input data dir.
model_name_or_path (str): Path to pre-trained student model or shortcut name,

selected in the list:

max_seq_length (int, default=128): The maximum length after tokenization.

Sequences longer than this will be truncated, sequences shorter will be padded.

do_lower_case (bool, default=True): Whether to lowercase the input when tokenizing. task (str, default=mrpc): The name of the task to fine-tune.

Choices include mrpc, qqp, qnli, rte, sts-b, cola, mnli, wnli.

model_type (str, default=’bert’): model type, support ‘distilbert’, ‘bert’,

‘mobilebert’, ‘roberta’.

dynamic_length (bool, default=False): Whether to use fixed sequence length. evaluate (bool, default=True): Whether do evaluation or training. transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according

to specific conditions.

Examples

dataset = ONNXRTBertDataset(data_dir=data_dir, model_name_or_path=’bert-base-uncase’,

transform=preprocess, filter=filter)

neural_compressor.experimental.data.datasets.bert_dataset.load_and_cache_examples(data_dir, model_name_or_path, max_seq_length, task, model_type, tokenizer, evaluate)

Load and cache the examples.

Helper Function for ONNXRTBertDataset.

neural_compressor.experimental.data.datasets.bert_dataset.convert_examples_to_features(examples, tokenizer, max_length=128, task=None, label_list=None, output_mode='classification', pad_token=0, pad_token_segment_id=0, mask_padding_with_zero=True)

Convert examples to features.

Helper function for load_and_cache_examples.

class neural_compressor.experimental.data.datasets.bert_dataset.InputFeatures

Single set of features of data.

Property names are the same names as the corresponding inputs to a model.

Parameters:
  • input_ids – Indices of input sequence tokens in the vocabulary.

  • attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: Usually 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

  • token_type_ids – (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.

  • label – (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.

  • seq_length – (Optional) The length of input sequence before padding.

to_json_string()

Serialize this instance to a JSON string.

class neural_compressor.experimental.data.datasets.bert_dataset.TensorflowBertDataset(root, label_file, task='squad', model_type='bert', transform=None, filter=None)

Bases: neural_compressor.experimental.data.datasets.dataset.Dataset

Tensorflow dataset used for model Bert.

This dataset supports tfrecord data, please refer to Guide to create tfrecord file first.

Args: root (str): path of dataset.

label_file (str): path of label file. task (str, default=’squad’): task type of model. model_type (str, default=’bert’): model type, support ‘bert’. transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according

to specific conditions

class neural_compressor.experimental.data.datasets.bert_dataset.ParseDecodeBert

Helper function for TensorflowModelZooBertDataset.

Parse the features from sample.

class neural_compressor.experimental.data.datasets.bert_dataset.TensorflowModelZooBertDataset(root, label_file, task='squad', model_type='bert', transform=None, filter=None, num_cores=28)

Bases: neural_compressor.experimental.data.datasets.dataset.Dataset

Tensorflow dataset for three-input Bert in tf record format.

Root is a full path to tfrecord file, which contains the file name. Please use Resize transform when batch_size > 1 Args: root (str): path of dataset.

label_file (str): path of label file. task (str, default=’squad’): task type of model. model_type (str, default=’bert’): model type, support ‘bert’. transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according.