:py:mod:`neural_compressor.data.transforms.tokenization` ======================================================== .. py:module:: neural_compressor.data.transforms.tokenization .. autoapi-nested-parse:: Tokenization helper classes. Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: neural_compressor.data.transforms.tokenization.FullTokenizer neural_compressor.data.transforms.tokenization.BasicTokenizer neural_compressor.data.transforms.tokenization.WordpieceTokenizer Functions ~~~~~~~~~ .. autoapisummary:: neural_compressor.data.transforms.tokenization.convert_to_unicode neural_compressor.data.transforms.tokenization.load_vocab neural_compressor.data.transforms.tokenization.convert_by_vocab neural_compressor.data.transforms.tokenization.whitespace_tokenize .. py:function:: convert_to_unicode(text) Convert `text` to Unicode (if it's not already), assuming utf-8 input. .. py:function:: load_vocab(vocab_file) Load a vocabulary file into a dictionary. .. py:function:: convert_by_vocab(vocab, items) Convert a sequence of [tokens|ids] using the vocab. .. py:function:: whitespace_tokenize(text) Run basic whitespace cleaning and splitting on a piece of text. .. py:class:: FullTokenizer(vocab_file, do_lower_case=True) Bases: :py:obj:`object` Run end-to-end tokenziation. .. py:method:: tokenize(text) Tokenize text. .. py:method:: convert_tokens_to_ids(tokens) Convert tokens to ids. .. py:method:: convert_ids_to_tokens(ids) Convert ids to tokens. .. py:class:: BasicTokenizer(do_lower_case=True) Bases: :py:obj:`object` Run basic tokenization (punctuation splitting, lower casing, etc.). .. py:method:: tokenize(text) Tokenizes a piece of text. .. py:class:: WordpieceTokenizer(vocab, unk_token='[UNK]', max_input_chars_per_word=200) Bases: :py:obj:`object` Run WordPiece tokenziation. .. py:method:: tokenize(text) Tokenize a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary. For example: input = "unaffable" output = ["un", "##aff", "##able"] :param text: A single token or whitespace separated tokens. This should have already been passed through `BasicTokenizer. :returns: A list of wordpiece tokens.