neural_compressor.data.transforms.tokenization

Tokenization helper classes.

Classes

FullTokenizer

Run end-to-end tokenziation.

BasicTokenizer

Run basic tokenization (punctuation splitting, lower casing, etc.).

WordpieceTokenizer

Run WordPiece tokenziation.

Functions

convert_to_unicode(text)

Convert text to Unicode (if it's not already), assuming utf-8 input.

load_vocab(vocab_file)

Load a vocabulary file into a dictionary.

convert_by_vocab(vocab, items)

Convert a sequence of [tokens|ids] using the vocab.

whitespace_tokenize(text)

Run basic whitespace cleaning and splitting on a piece of text.

Module Contents

neural_compressor.data.transforms.tokenization.convert_to_unicode(text)[source]

Convert text to Unicode (if it’s not already), assuming utf-8 input.

neural_compressor.data.transforms.tokenization.load_vocab(vocab_file)[source]

Load a vocabulary file into a dictionary.

neural_compressor.data.transforms.tokenization.convert_by_vocab(vocab, items)[source]

Convert a sequence of [tokens|ids] using the vocab.

neural_compressor.data.transforms.tokenization.whitespace_tokenize(text)[source]

Run basic whitespace cleaning and splitting on a piece of text.

class neural_compressor.data.transforms.tokenization.FullTokenizer(vocab_file, do_lower_case=True)[source]

Run end-to-end tokenziation.

class neural_compressor.data.transforms.tokenization.BasicTokenizer(do_lower_case=True)[source]

Run basic tokenization (punctuation splitting, lower casing, etc.).

class neural_compressor.data.transforms.tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)[source]

Run WordPiece tokenziation.