Using NLP Operations¶

These example will show you how to use DFFML operations to clean text data and train Tensorflow DNNClassifier model and Scikit Learn Naive Bayes Classifier model using DFFML cli.

Preprocessing data and training DNNClassifier model¶

DFFML offers several Models. For this example we will be using the tensorflow DNNClassifier model (tfdnnc) which is in the dffml-model-tensorflow package.

We will use two operations remove_stopwords and get_embedding. Internally, both of these operations use spacy functions.

To install DNNClassifier model and the above mentioned operations run:

$ python -m pip install -U dffml-model-tensorflow dffml-operations-nlp

Operation remove_stopwords cleans the text by removing most commanly used words which give the text little or no information eg. but, or, yet, it, is, am, etc. These words are called StopWords. Operation get_embedding maps the tokens in the text to their corresponding word-vectors. Here we will use embeddings from en_core_web_sm spacy model. You can use other models like en_core_web_md, en_core_web_lg for better results but these are bigger in size and may take a while to download.

Let’s first download the en_core_web_sm model.

$ python -m spacy download en_core_web_sm

Warning

Spacy and aiohttp don’t play nice together. If you have aiohttp installed you’re going to get weird messages about the version of the chardet package (pkg_resources.ContextualVersionConflict). To avoid this, run the download command and include aiohttp in the list of packages to download, since spacy is passing through to pip here.

$ python -m spacy download en_core_web_sm aiohttp

Create training data:

train_data.csv

sentence,sentiment
What a pleasant morning,1
Those were bad days,0
My puppy plays all day,1
Cats are evil,0

Now we will create a dataflow to describe how the text feature (sentence) will be processed.

$ dffml dataflow create get_single remove_stopwords get_embedding \
    -inputs \
      '["embedding"]'=get_single_spec \
      "en_core_web_sm"=spacy_model_name_def \
      "<PAD>"=pad_token_def 10=max_len_def \
    -flow \
      '[{"seed": ["sentence"]}]'=remove_stopwords.inputs.text \
      '[{"seed": ["spacy_model_name_def"]}]'=get_embedding.inputs.spacy_model \
      '[{"seed": ["pad_token_def"]}]'=get_embedding.inputs.pad_token \
      '[{"seed": ["max_len_def"]}]'=get_embedding.inputs.max_len \
      '[{"remove_stopwords": "result"}]'=get_embedding.inputs.text \
      '[{"remove_stopwords": "result"}]'=get_embedding.inputs.text | \
    tee nlp_ops_dataflow.json

Operation get_embedding takes pad_token as input (here <PAD>) to append to sentences of length smaller than max_len (here 10). A sentence which has length greater than max_len is truncated to have length equal to max_len.

To visualize the dataflow run:

$ dffml dataflow diagram -stage processing -- nlp_ops_dataflow.json

Copy and pasting the output of the above code into the mermaidjs live editor results in the graph.

We can now use this dataflow to preprocess the data and make it ready to be fed into model:

$ dffml train \
    -model tfdnnc \
    -model-batchsize 100 \
    -model-hidden 5 2 \
    -model-clstype int \
    -model-predict sentiment:int:1 \
    -model-classifications 0 1 \
    -model-location tempdir \
    -model-features embedding:float:[1,10,96] \
    -sources text=dfpreprocess \
    -source-text-dataflow nlp_ops_dataflow.json \
    -source-text-features sentence:str:1 \
    -source-text-source csv \
    -source-text-source-filename train_data.csv \
    -log debug

As shown in the above command, a single input feature to model (here embedding) is of shape (1, max_len, size_of_embedding). Here we have taken max_len as 10 and the embedding size of en_core_web_sm is 96. So the resulting size of one input feature is (1,10,96).

Assess accuracy:

$ dffml accuracy \
    -scorer clf \
    -model tfdnnc \
    -model-batchsize 100 \
    -model-hidden 5 2 \
    -model-clstype int \
    -model-predict sentiment:int:1 \
    -model-classifications 0 1 \
    -model-location tempdir \
    -model-features embedding:float:[1,10,96] \
    -sources text=dfpreprocess \
    -source-text-dataflow nlp_ops_dataflow.json \
    -source-text-features sentence:str:1 \
    -source-text-source csv \
    -source-text-source-filename train_data.csv \
    -log debug
0.5

Create test data:

test_data.csv

sentence
Cats play a lot

Make prediction on test data:

$ dffml predict all \
    -model tfdnnc \
    -model-batchsize 100 \
    -model-hidden 5 2 \
    -model-clstype int \
    -model-predict sentiment:int:1 \
    -model-classifications 0 1 \
    -model-location tempdir \
    -model-features embedding:float:[1,10,96] \
    -sources text=dfpreprocess \
    -source-text-dataflow nlp_ops_dataflow.json \
    -source-text-features sentence:str:1 \
    -source-text-source csv \
    -source-text-source-filename test_data.csv \
    -pretty

        Key:    0
                                                   Record Features
+------------------------------------------------------------------------------------------------------------------------------+
|            sentence           |                                       Cats play a lot                                        |
+------------------------------------------------------------------------------------------------------------------------------+
|           embedding           |                  (0.32292864, 4.358501, 3.2268033, 1.87990 ... (length:10)                   |
+------------------------------------------------------------------------------------------------------------------------------+

                                                        Prediction
+------------------------------------------------------------------------------------------------------------------------------+
|                                                          sentiment                                                           |
+------------------------------------------------------------------------------------------------------------------------------+
|           Value:  1           |                               Confidence:   0.5122595429420471                               |
+------------------------------------------------------------------------------------------------------------------------------+

Preprocessing data and training Naive Bayes Classifier model¶

Now we will see how to use traditional ML algorithm like Naive Bayes Classifier available in dffml-model-scikit (dffml_model_scikit) for classification.

Install the Naive Bayes Classifier by installing dffml-model-scikit

$ python -m pip install -U dffml-model-scikit

Create training data:

train_data.csv

sentence,sentiment
What a pleasant morning,1
Those were bad days,0
My puppy plays all day,1
Cats are evil,0

But before we feed the data to model we need to convert it to vectors of numeric values. Here we will use tfidf_vectorizer operation (tfidf_vectorizer) which is a wrapper around sklearn TfidfVectorizer.

The dataflow will be similar to the one used above but with a slight modification. We will add an extra operation collect_output (collect_output) which will collect all the records before forwarding them to next operation. This is to ensure that tfidf_vectorizer receives a list of sentence rather than a single sentence at a time. The matrix returned by tfidf_vectorizer will be passed to extract_array_from_matrix (extract_array_from_matrix) which will return the array corresponding to each sentence.

So, Let’s modify the dataflow to use our new operations.

$ dffml dataflow create \
    -inputs \
      '["extract_array_from_matrix.outputs.result"]'=get_single_spec \
      4=source_length \
    -flow \
      '[{"seed": ["sentence"]}]'=remove_stopwords.inputs.text \
      '[{"seed": ["source_length"]}]'=collect_output.inputs.length \
      '[{"remove_stopwords": "result"}]'=collect_output.inputs.sentence \
      '[{"collect_output": "all"}]'=tfidf_vectorizer.inputs.text \
      '[{"remove_stopwords": "result"}]'=extract_array_from_matrix.inputs.single_text_example \
      '[{"collect_output": "all"}]'=extract_array_from_matrix.inputs.collected_text \
      '[{"tfidf_vectorizer": "result"}]'=extract_array_from_matrix.inputs.input_matrix \
    -- \
      get_single \
      remove_stopwords \
      collect_output \
      extract_array_from_matrix \
      tfidf_vectorizer | \
    tee nlp_ops_dataflow.json

To visualize the dataflow run:

$ dffml dataflow diagram -stage processing -- nlp_ops_dataflow.json

We can now use this dataflow to preprocess the data and make it ready to be fed into model:

$ dffml train \
    -model scikitgnb \
    -model-features extract_array_from_matrix.outputs.result:float:1 \
    -model-predict sentiment:int:1 \
    -model-location tempdir \
    -sources text=old \
    -source-text-dataflow nlp_ops_dataflow.json \
    -source-text-features sentence:str:1 \
    -source-text-source csv \
    -source-text-source-filename train_data.csv \
    -log debug

Assess accuracy:

$ dffml accuracy \
    -scorer mse \
    -model scikitgnb \
    -model-features extract_array_from_matrix.outputs.result:float:1 \
    -model-predict sentiment:int:1 \
    -model-location tempdir \
    -sources text=dfpreprocess \
    -source-text-dataflow nlp_ops_dataflow.json \
    -source-text-features sentence:str:1 \
    -source-text-source csv \
    -source-text-source-filename train_data.csv \
    -log debug
1.0

Create test data:

test_data.csv

sentence
Such a pleasant morning
Those were good days
My cat plays all day
Dogs are evil

We’re going to make predictions on test data, but first we’ll use the merge command to pre-process the data. The merge command takes data from one source and puts it in another. We can take records from the preprocessing source and put them in the JSON source as an intermediary format.

Note

Processing of sentences occurs concurrently, resulting in seemingly randomized output order.

$ dffml merge text=dfpreprocess temp=json \
    -source-text-dataflow nlp_ops_dataflow.json \
    -source-text-features sentence:str:1 \
    -source-text-source csv \
    -source-text-source-filename test_data.csv \
    -source-temp-filename test_data_preprocessed.json \
    -source-temp-allowempty \
    -source-temp-readwrite \
    -log debug
$ cat test_data_preprocessed.json | python -m json.tool

Now we can make prediction on test data:

$ dffml predict all \
    -model scikitgnb \
    -model-features extract_array_from_matrix.outputs.result:float:1 \
    -model-predict sentiment:int:1 \
    -model-location tempdir \
    -sources temp=json \
    -source-temp-filename test_data_preprocessed.json \
    -pretty

        Key:        1
                                        Record Features
+------------------------------------------------------------------------------------------------+
|        sentence        |                          Those were good days                         |
+------------------------------------------------------------------------------------------------+
|extract_array_from_matri|             0.0, 0.0, 0.7071067811865476, 0 ... (length:9)            |
+------------------------------------------------------------------------------------------------+

                                        Prediction
+------------------------------------------------------------------------------------------------+
|                                           sentiment                                            |
+------------------------------------------------------------------------------------------------+
|       Value:  1        |                           Confidence:   1.0                           |
+------------------------------------------------------------------------------------------------+

    Key:    2
                                        Record Features
+------------------------------------------------------------------------------------------------+
|        sentence        |                          My cat plays all day                         |
+------------------------------------------------------------------------------------------------+
|extract_array_from_matri|             0.5773502691896257, 0.577350269 ... (length:9)            |
+------------------------------------------------------------------------------------------------+

                                        Prediction
+------------------------------------------------------------------------------------------------+
|                                           sentiment                                            |
+------------------------------------------------------------------------------------------------+
|       Value:  0        |                           Confidence:   1.0                           |
+------------------------------------------------------------------------------------------------+

    Key:    0
                                        Record Features
+------------------------------------------------------------------------------------------------+
|        sentence        |                        Such a pleasant morning                        |
+------------------------------------------------------------------------------------------------+
|extract_array_from_matri|             0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0 ... (length:9)            |
+------------------------------------------------------------------------------------------------+

                                        Prediction
+------------------------------------------------------------------------------------------------+
|                                           sentiment                                            |
+------------------------------------------------------------------------------------------------+
|       Value:  1        |                           Confidence:   1.0                           |
+------------------------------------------------------------------------------------------------+

    Key:    3
                                        Record Features
+------------------------------------------------------------------------------------------------+
|        sentence        |                             Dogs are evil                             |
+------------------------------------------------------------------------------------------------+
|extract_array_from_matri|             0.0, 0.0, 0.0, 0.70710678118654 ... (length:9)            |
+------------------------------------------------------------------------------------------------+

                                        Prediction
+------------------------------------------------------------------------------------------------+
|                                           sentiment                                            |
+------------------------------------------------------------------------------------------------+
|       Value:  0        |                           Confidence:   1.0                           |
+------------------------------------------------------------------------------------------------+