neural_compressor.data.transforms.tokenization
¶
Tokenization helper classes.
Module Contents¶
Classes¶
Run end-to-end tokenziation. |
|
Run basic tokenization (punctuation splitting, lower casing, etc.). |
|
Run WordPiece tokenziation. |
Functions¶
|
Convert text to Unicode (if it's not already), assuming utf-8 input. |
|
Load a vocabulary file into a dictionary. |
|
Convert a sequence of [tokens|ids] using the vocab. |
|
Run basic whitespace cleaning and splitting on a piece of text. |
- neural_compressor.data.transforms.tokenization.convert_to_unicode(text)¶
Convert text to Unicode (if it’s not already), assuming utf-8 input.
- neural_compressor.data.transforms.tokenization.load_vocab(vocab_file)¶
Load a vocabulary file into a dictionary.
- neural_compressor.data.transforms.tokenization.convert_by_vocab(vocab, items)¶
Convert a sequence of [tokens|ids] using the vocab.
- neural_compressor.data.transforms.tokenization.whitespace_tokenize(text)¶
Run basic whitespace cleaning and splitting on a piece of text.
- class neural_compressor.data.transforms.tokenization.FullTokenizer(vocab_file, do_lower_case=True)¶
Bases:
object
Run end-to-end tokenziation.
- tokenize(text)¶
Tokenize text.
- convert_tokens_to_ids(tokens)¶
Convert tokens to ids.
- convert_ids_to_tokens(ids)¶
Convert ids to tokens.
- class neural_compressor.data.transforms.tokenization.BasicTokenizer(do_lower_case=True)¶
Bases:
object
Run basic tokenization (punctuation splitting, lower casing, etc.).
- tokenize(text)¶
Tokenizes a piece of text.
- class neural_compressor.data.transforms.tokenization.WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200)¶
Bases:
object
Run WordPiece tokenziation.
- tokenize(text)¶
Tokenize a piece of text into its word pieces.
This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example:
input = “unaffable” output = [“un”, “##aff”, “##able”]
- Parameters:
text – A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer.
- Returns:
A list of wordpiece tokens.