neural_compressor.data.transforms.tokenization
Tokenization helper classes.
Classes
Run end-to-end tokenziation. |
|
Run basic tokenization (punctuation splitting, lower casing, etc.). |
|
Run WordPiece tokenziation. |
Functions
|
Convert text to Unicode (if it's not already), assuming utf-8 input. |
|
Load a vocabulary file into a dictionary. |
|
Convert a sequence of [tokens|ids] using the vocab. |
|
Run basic whitespace cleaning and splitting on a piece of text. |
Module Contents
- neural_compressor.data.transforms.tokenization.convert_to_unicode(text)[source]
Convert text to Unicode (if it’s not already), assuming utf-8 input.
- neural_compressor.data.transforms.tokenization.load_vocab(vocab_file)[source]
Load a vocabulary file into a dictionary.
- neural_compressor.data.transforms.tokenization.convert_by_vocab(vocab, items)[source]
Convert a sequence of [tokens|ids] using the vocab.
- neural_compressor.data.transforms.tokenization.whitespace_tokenize(text)[source]
Run basic whitespace cleaning and splitting on a piece of text.
- class neural_compressor.data.transforms.tokenization.FullTokenizer(vocab_file, do_lower_case=True)[source]
Run end-to-end tokenziation.