neural_compressor.data.transforms.tokenization

Tokenization helper classes.

Module Contents

Classes

FullTokenizer

Run end-to-end tokenziation.

BasicTokenizer

Run basic tokenization (punctuation splitting, lower casing, etc.).

WordpieceTokenizer

Run WordPiece tokenziation.

Functions

convert_to_unicode(text)

Convert text to Unicode (if it's not already), assuming utf-8 input.

load_vocab(vocab_file)

Load a vocabulary file into a dictionary.

convert_by_vocab(vocab, items)

Convert a sequence of [tokens|ids] using the vocab.

whitespace_tokenize(text)

Run basic whitespace cleaning and splitting on a piece of text.

neural_compressor.data.transforms.tokenization.convert_to_unicode(text)

Convert text to Unicode (if it’s not already), assuming utf-8 input.

neural_compressor.data.transforms.tokenization.load_vocab(vocab_file)

Load a vocabulary file into a dictionary.

neural_compressor.data.transforms.tokenization.convert_by_vocab(vocab, items)

Convert a sequence of [tokens|ids] using the vocab.

neural_compressor.data.transforms.tokenization.whitespace_tokenize(text)

Run basic whitespace cleaning and splitting on a piece of text.

class neural_compressor.data.transforms.tokenization.FullTokenizer(vocab_file, do_lower_case=True)

Bases: object

Run end-to-end tokenziation.

tokenize(text)

Tokenize text.

convert_tokens_to_ids(tokens)

Convert tokens to ids.

convert_ids_to_tokens(ids)

Convert ids to tokens.

class neural_compressor.data.transforms.tokenization.BasicTokenizer(do_lower_case=True)

Bases: object

Run basic tokenization (punctuation splitting, lower casing, etc.).

tokenize(text)

Tokenizes a piece of text.

class neural_compressor.data.transforms.tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)

Bases: object

Run WordPiece tokenziation.

tokenize(text)

Tokenize a piece of text into its word pieces.

This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example:

input = “unaffable” output = [“un”, “##aff”, “##able”]

Parameters:

text – A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.

Returns:

A list of wordpiece tokens.