MNIST Handwriten Digits¶

This example will show you how to train a model on the MNIST dataset and use the model for prediction via the DFFML CLI and HTTP API.

Download the files and verify them with sha384sum.

curl -sSLO "http://yann.lecun.com/exdb/mnist/{train-images-idx3,train-labels-idx1,t10k-images-idx3,t10k-labels-idx1}-ubyte.gz"
sha384sum -c - << EOF
1bf45877962fd391f7abb20534a30fd2203d0865309fec5f87d576dbdbefdcb16adb49220afc22a0f3478359d229449c  t10k-images-idx3-ubyte.gz
ccc1ee70f798a04e6bfeca56a4d0f0de8d8eeeca9f74641c1e1bfb00cf7cc4aa4d023f6ea1b40e79bb4707107845479d  t10k-labels-idx1-ubyte.gz
f40eb179f7c3d2637e789663bde56d444a23e4a0a14477a9e6ed88bc39c8ad6eaff68056c0cd9bb60daf0062b70dc8ee  train-images-idx3-ubyte.gz
ba9c11bf9a7f7c2c04127b8b3e568cf70dd3429d9029ca59b7650977a4ac32f8ff5041fe42bc872097487b06a6794e00  train-labels-idx1-ubyte.gz
EOF

t10k-images-idx3-ubyte.gz: OK
t10k-labels-idx1-ubyte.gz: OK
train-images-idx3-ubyte.gz: OK
train-labels-idx1-ubyte.gz: OK

The model we’ll be using is a part of dffml-model-tensorflow, which is a DFFML plugin which allows you to use TensorFlow via DFFML. We’ll also be using the resize operation from dffml-operations-image, which is a DFFML plugin which allows you to do image processing via DFFML. We can install them with pip.

$ pip install -U dffml-model-tensorflow dffml-operations-image

Create a dataflow config file which will be used by the DataFlowPreprocessSource to normalize the data just before feeding it to the model.

In the config file, using the dataflow create command we create a DataFlow consisting of 2 operations: multiply and associate_definition. We edit the DataFlow to have the multiplicand input of the multiply operation come from a list in seed containing only one element that is the image feature and the seed to have the input of the associate_definition operation come from the output (product) of the multiply operation.

dffml dataflow create multiply associate_definition -configloader yaml \
    -inputs '{"image": "product"}'=associate_spec 0.00392156862=multiplier_def \
    -flow '[{"seed": ["image"]}]'=multiply.inputs.multiplicand |
    tee normalize.yaml

Train the model.

dffml train \
    -model tfdnnc \
    -model-batchsize 1000 \
    -model-hidden 30 50 25 \
    -model-clstype int \
    -model-predict label:int:1 \
    -model-classifications $(seq 0 9) \
    -model-location tempdir \
    -model-features image:int:$((28 * 28)) \
    -sources images=dfpreprocess label=idx1 \
    -source-images-dataflow normalize.yaml \
    -source-images-features image:int:$((28 * 28)) \
    -source-images-source idx3 \
    -source-images-source-filename train-images-idx3-ubyte.gz \
    -source-images-source-feature image \
    -source-label-filename train-labels-idx1-ubyte.gz \
    -source-label-feature label \
    -log critical

Assess the model’s accuracy.

dffml accuracy \
    -scorer clf \
    -model tfdnnc \
    -model-batchsize 1000 \
    -model-hidden 30 50 25 \
    -model-clstype int \
    -model-predict label:int:1 \
    -model-location tempdir \
    -model-classifications $(seq 0 9) \
    -model-features image:int:$((28 * 28)) \
    -features label:int:1 \
    -sources images=dfpreprocess label=idx1 \
    -source-images-dataflow normalize.yaml \
    -source-images-features image:int:$((28 * 28)) \
    -source-images-source idx3 \
    -source-images-source-filename t10k-images-idx3-ubyte.gz \
    -source-images-source-feature image \
    -source-label-filename t10k-labels-idx1-ubyte.gz \
    -source-label-feature label \
    -log critical

0.9591000080108643

Create an image.csv file which contains the names of the images (with their extension .png) to predict on.

cat > image.csv << EOF
key,image
four,image1.png
five,image2.png
three,image3.png
two,image4.png
EOF

In the config file, using the dataflow create command we create a DataFlow consisting of 3 operations: resize, multiply and associate_definition. We edit the DataFlow to have the data input of the resize operation come from a list in seed containing image feature, have the multiplicand input of multiply operation come from the output of flatten operation whose input comes from the output (resized_array) of the resize operation and to have the input of the associate_definition operation come from the output (product) of the multiply operation.

dffml dataflow create resize flatten multiply associate_definition -configloader yaml \
    -inputs \
      '[28,28]'=resize.inputs.dsize \
      '{"image": "product"}'=associate_spec \
      0.00392156862=multiplier_def \
    -flow \
      '[{"seed": ["image"]}]'=resize.inputs.src \
      '[{"resize": "result"}]'=flatten.inputs.array \
      '[{"flatten": "result"}]'=multiply.inputs.multiplicand |
    tee resizenorm.yaml

In this example, the image.csv file contains the names of the following images

Predict with the trained model.

dffml predict all \
    -model tfdnnc \
    -model-batchsize 1000 \
    -model-hidden 30 50 25 \
    -model-clstype int \
    -model-predict label:int:1 \
    -model-location tempdir \
    -model-classifications $(seq 0 9) \
    -model-features image:int:$((28 * 28)) \
    -sources images=dfpreprocess \
    -source-images-source csv \
    -source-images-source-filename image.csv \
    -source-images-source-loadfiles image \
    -source-images-features image:int:$((28 * 28)) \
    -source-images-dataflow resizenorm.yaml \
    -log critical

Output

[
    {
        "extra": {},
        "features": {
            "image": [
                0.0,
                .
                .
                0.0
            ]
        },
        "key": "four",
        "last_updated": "2020-03-18T04:07:01Z",
        "prediction": {
            "label": {
                "confidence": 0.9977860450744629,
                "value": 4
            }
        }
    },
    {
        "extra": {},
        "features": {
            "image": [
                0.0,
                .
                .
                0.0
            ]
        },
        "key": "five",
        "last_updated": "2020-03-18T04:07:01Z",
        "prediction": {
            "label": {
                "confidence": 1.0,
                "value": 5
            }
        }
    },
    {
        "extra": {},
        "features": {
            "image": [
                0.0,
                .
                .
                0.0
            ]
        },
        "key": "three",
        "last_updated": "2020-03-18T04:07:01Z",
        "prediction": {
            "label": {
                "confidence": 0.9998736381530762,
                "value": 3
            }
        }
    },
    {
        "extra": {},
        "features": {
            "image": [
                0.0,
                .
                .
                0.0
            ]
        },
        "key": "two",
        "last_updated": "2020-03-18T04:07:01Z",
        "prediction": {
            "label": {
                "confidence": 1.0,
                "value": 2
            }
        }
    }
]