neural_compressor.data.datasets.bert_dataset
¶
Built-in BERT datasets class for multiple framework backends.
Module Contents¶
Classes¶
PyTorch dataset used for model Bert. |
|
ONNXRT dataset used for model Bert. |
|
Single set of features of data. |
|
Tensorflow dataset used for model Bert. |
|
Helper function for TensorflowModelZooBertDataset. |
|
Tensorflow dataset for three-input Bert in tf record format. |
Functions¶
|
Load and cache the examples. |
|
Convert examples to features. |
- class neural_compressor.data.datasets.bert_dataset.PytorchBertDataset(dataset, task, model_type='bert', transform=None, filter=None)¶
Bases:
neural_compressor.data.datasets.dataset.Dataset
PyTorch dataset used for model Bert.
This Dataset is to construct from the Bert TensorDataset and not a full implementation from yaml config. The original repo link is: https://github.com/huggingface/transformers. When you want use this Dataset, you should add it before you initialize your DataLoader. (TODO) add end to end support for easy config by yaml by adding the method of load examples and process method.
- Args: dataset (list): list of data.
task (str): the task of the model, support “classifier”, “squad”. model_type (str, default=’bert’): model type, support ‘distilbert’, ‘bert’,
‘xlnet’, ‘xlm’.
transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according
to specific conditions.
Examples
- dataset = [[
[101,2043,2001], [1,1,1], [[0,0,0,0,0,0,0],
[0,0,0,0,0,0,0], [0,0,0,0,0,0,0]],
[1,1,1], [1,1,1], [[0,0,0,0,0,0,0],
[0,0,0,0,0,0,0], [0,0,0,0,0,0,0]]
]] dataset = PytorchBertDataset(dataset=dataset, task=’classifier’, model_type=’bert’,
transform=preprocess, filter=filter)
- class neural_compressor.data.datasets.bert_dataset.ONNXRTBertDataset(data_dir, model_name_or_path, max_seq_length=128, do_lower_case=True, task='mrpc', model_type='bert', dynamic_length=False, evaluate=True, transform=None, filter=None)¶
Bases:
neural_compressor.data.datasets.dataset.Dataset
ONNXRT dataset used for model Bert.
- Args: data_dir (str): The input data dir.
- model_name_or_path (str): Path to pre-trained student model or shortcut name,
selected in the list:
- max_seq_length (int, default=128): The maximum length after tokenization.
Sequences longer than this will be truncated, sequences shorter will be padded.
do_lower_case (bool, default=True): Whether to lowercase the input when tokenizing. task (str, default=mrpc): The name of the task to fine-tune.
Choices include mrpc, qqp, qnli, rte, sts-b, cola, mnli, wnli.
- model_type (str, default=’bert’): model type, support ‘distilbert’, ‘bert’,
‘mobilebert’, ‘roberta’.
dynamic_length (bool, default=False): Whether to use fixed sequence length. evaluate (bool, default=True): Whether do evaluation or training. transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according
to specific conditions.
Examples
- dataset = ONNXRTBertDataset(data_dir=data_dir, model_name_or_path=’bert-base-uncase’,
transform=preprocess, filter=filter)
- neural_compressor.data.datasets.bert_dataset.load_and_cache_examples(data_dir, model_name_or_path, max_seq_length, task, model_type, tokenizer, evaluate)¶
Load and cache the examples.
Helper Function for ONNXRTBertDataset.
- neural_compressor.data.datasets.bert_dataset.convert_examples_to_features(examples, tokenizer, max_length=128, task=None, label_list=None, output_mode='classification', pad_token=0, pad_token_segment_id=0, mask_padding_with_zero=True)¶
Convert examples to features.
Helper function for load_and_cache_examples.
- class neural_compressor.data.datasets.bert_dataset.InputFeatures¶
Single set of features of data.
Property names are the same names as the corresponding inputs to a model.
- Parameters:
input_ids – Indices of input sequence tokens in the vocabulary.
attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]
: Usually1
for tokens that are NOT MASKED,0
for MASKED (padded) tokens.token_type_ids – (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.
label – (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.
seq_length – (Optional) The length of input sequence before padding.
- to_json_string()¶
Serialize this instance to a JSON string.
- class neural_compressor.data.datasets.bert_dataset.TensorflowBertDataset(root, label_file, task='squad', model_type='bert', transform=None, filter=None)¶
Bases:
neural_compressor.data.datasets.dataset.Dataset
Tensorflow dataset used for model Bert.
This dataset supports tfrecord data, please refer to Guide to create tfrecord file first.
- Args: root (str): path of dataset.
label_file (str): path of label file. task (str, default=’squad’): task type of model. model_type (str, default=’bert’): model type, support ‘bert’. transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according
to specific conditions
- class neural_compressor.data.datasets.bert_dataset.ParseDecodeBert¶
Helper function for TensorflowModelZooBertDataset.
Parse the features from sample.
- class neural_compressor.data.datasets.bert_dataset.TensorflowModelZooBertDataset(root, label_file, task='squad', model_type='bert', transform=None, filter=None, num_cores=28)¶
Bases:
neural_compressor.data.datasets.dataset.Dataset
Tensorflow dataset for three-input Bert in tf record format.
Root is a full path to tfrecord file, which contains the file name. Please use Resize transform when batch_size > 1 Args: root (str): path of dataset.
label_file (str): path of label file. task (str, default=’squad’): task type of model. model_type (str, default=’bert’): model type, support ‘bert’. transform (transform object, default=None): transform to process input data. filter (Filter objects, default=None): filter out examples according.