MNIST Handwriten Digits¶
This example will show you how to train a model on the MNIST dataset and use the model for prediction via the DFFML CLI and HTTP API.
Download the files and verify them with sha384sum
.
curl -sSLO "http://yann.lecun.com/exdb/mnist/{train-images-idx3,train-labels-idx1,t10k-images-idx3,t10k-labels-idx1}-ubyte.gz"
sha384sum -c - << EOF
1bf45877962fd391f7abb20534a30fd2203d0865309fec5f87d576dbdbefdcb16adb49220afc22a0f3478359d229449c t10k-images-idx3-ubyte.gz
ccc1ee70f798a04e6bfeca56a4d0f0de8d8eeeca9f74641c1e1bfb00cf7cc4aa4d023f6ea1b40e79bb4707107845479d t10k-labels-idx1-ubyte.gz
f40eb179f7c3d2637e789663bde56d444a23e4a0a14477a9e6ed88bc39c8ad6eaff68056c0cd9bb60daf0062b70dc8ee train-images-idx3-ubyte.gz
ba9c11bf9a7f7c2c04127b8b3e568cf70dd3429d9029ca59b7650977a4ac32f8ff5041fe42bc872097487b06a6794e00 train-labels-idx1-ubyte.gz
EOF
t10k-images-idx3-ubyte.gz: OK
t10k-labels-idx1-ubyte.gz: OK
train-images-idx3-ubyte.gz: OK
train-labels-idx1-ubyte.gz: OK
The model we’ll be using is a part of dffml-model-tensorflow
, which is
a DFFML plugin which allows you to use TensorFlow via DFFML.
We’ll also be using the resize operation from dffml-operations-image
, which is
a DFFML plugin which allows you to do image processing via DFFML.
We can install them with pip
.
$ pip install -U dffml-model-tensorflow dffml-operations-image
Create a dataflow config file which will be used by the
DataFlowPreprocessSource
to normalize the
data just before feeding it to the model.
In the config file, using the dataflow create
command we create a DataFlow
consisting of 2 operations: multiply and associate_definition. We
edit the DataFlow to have the multiplicand input of the multiply operation come
from a list in seed containing only one element that is the image
feature
and the seed to have the input of the associate_definition operation come from the output
(product) of the multiply operation.
dffml dataflow create multiply associate_definition -configloader yaml \
-inputs '{"image": "product"}'=associate_spec 0.00392156862=multiplier_def \
-flow '[{"seed": ["image"]}]'=multiply.inputs.multiplicand |
tee normalize.yaml
Train the model.
dffml train \
-model tfdnnc \
-model-batchsize 1000 \
-model-hidden 30 50 25 \
-model-clstype int \
-model-predict label:int:1 \
-model-classifications $(seq 0 9) \
-model-location tempdir \
-model-features image:int:$((28 * 28)) \
-sources images=dfpreprocess label=idx1 \
-source-images-dataflow normalize.yaml \
-source-images-features image:int:$((28 * 28)) \
-source-images-source idx3 \
-source-images-source-filename train-images-idx3-ubyte.gz \
-source-images-source-feature image \
-source-label-filename train-labels-idx1-ubyte.gz \
-source-label-feature label \
-log critical
Assess the model’s accuracy.
dffml accuracy \
-scorer clf \
-model tfdnnc \
-model-batchsize 1000 \
-model-hidden 30 50 25 \
-model-clstype int \
-model-predict label:int:1 \
-model-location tempdir \
-model-classifications $(seq 0 9) \
-model-features image:int:$((28 * 28)) \
-features label:int:1 \
-sources images=dfpreprocess label=idx1 \
-source-images-dataflow normalize.yaml \
-source-images-features image:int:$((28 * 28)) \
-source-images-source idx3 \
-source-images-source-filename t10k-images-idx3-ubyte.gz \
-source-images-source-feature image \
-source-label-filename t10k-labels-idx1-ubyte.gz \
-source-label-feature label \
-log critical
0.9591000080108643
Create an image.csv
file which contains the names of the images (with their extension .png) to predict on.
cat > image.csv << EOF
key,image
four,image1.png
five,image2.png
three,image3.png
two,image4.png
EOF
In the config file, using the dataflow create
command we create a DataFlow
consisting of 3 operations: resize, multiply and associate_definition. We
edit the DataFlow to have the data input of the resize operation come
from a list in seed containing image
feature, have the multiplicand input of
multiply operation come from the output of flatten operation whose input
comes from the output (resized_array) of the resize operation
and to have the input of the associate_definition operation come from the output
(product) of the multiply operation.
dffml dataflow create resize flatten multiply associate_definition -configloader yaml \
-inputs \
'[28,28]'=resize.inputs.dsize \
'{"image": "product"}'=associate_spec \
0.00392156862=multiplier_def \
-flow \
'[{"seed": ["image"]}]'=resize.inputs.src \
'[{"resize": "result"}]'=flatten.inputs.array \
'[{"flatten": "result"}]'=multiply.inputs.multiplicand |
tee resizenorm.yaml
In this example, the image.csv
file contains the names of the following images
Predict with the trained model.
dffml predict all \
-model tfdnnc \
-model-batchsize 1000 \
-model-hidden 30 50 25 \
-model-clstype int \
-model-predict label:int:1 \
-model-location tempdir \
-model-classifications $(seq 0 9) \
-model-features image:int:$((28 * 28)) \
-sources images=dfpreprocess \
-source-images-source csv \
-source-images-source-filename image.csv \
-source-images-source-loadfiles image \
-source-images-features image:int:$((28 * 28)) \
-source-images-dataflow resizenorm.yaml \
-log critical
Output
[
{
"extra": {},
"features": {
"image": [
0.0,
.
.
0.0
]
},
"key": "four",
"last_updated": "2020-03-18T04:07:01Z",
"prediction": {
"label": {
"confidence": 0.9977860450744629,
"value": 4
}
}
},
{
"extra": {},
"features": {
"image": [
0.0,
.
.
0.0
]
},
"key": "five",
"last_updated": "2020-03-18T04:07:01Z",
"prediction": {
"label": {
"confidence": 1.0,
"value": 5
}
}
},
{
"extra": {},
"features": {
"image": [
0.0,
.
.
0.0
]
},
"key": "three",
"last_updated": "2020-03-18T04:07:01Z",
"prediction": {
"label": {
"confidence": 0.9998736381530762,
"value": 3
}
}
},
{
"extra": {},
"features": {
"image": [
0.0,
.
.
0.0
]
},
"key": "two",
"last_updated": "2020-03-18T04:07:01Z",
"prediction": {
"label": {
"confidence": 1.0,
"value": 2
}
}
}
]