Packaging a Model¶
In the previous tutorial we created a DFFML style model, which we could use from
the command line, HTTP service, etc. We’re now going to take that model and
package it so that it can be published to PyPi for others to download via
pip
and use as they do the rest of the Models plugins.
Create the Package¶
To create a new model we first create a new Python package. DFFML has a helper to create it for you.
The helper creates a model which does Simple Linear Regression (SLR). Which means it finds the best fit line for a dataset. If you’ve done the Writing a Model tutorial then the same code from that tutorial in myslr.py will be present in dffml_model_myslr/myslr.py (not your modifications though if you made any).
You may know the best fit line as y = m * x + b
$ dffml service dev create model dffml-model-myslr
Then enter the directory of the package we just created
$ cd dffml-model-myslr
Install the Package¶
If you’re planing on importing any third party packages, anything on
PyPi, you’ll want to add it to the setup.cfg
file
first, under the install_requires
section.
setup.cfg
scikit-learn>=0.21.2
Any time you modify the dependencies of a package you should re-install it so they get installed as well.
pip
’s -e
flag tells it we’re installing our package in development mode,
which means anytime Python import’s our package, it’s going to use the version
that we’re working on here. If you don’t pass -e
then anytime you make
changes in this directory, they won’t take effect until you reinstall the
package.
$ python -m pip install -e .[dev]
Testing¶
Packages should have tests to make sure that when you change things they don’t break before you release your package to users.
Imports¶
Let’s import everything we’ll need. Since DFFML heavily leverages async
code, we’ll have our test cases derive from AsyncTestCase
, rather
than unittest.TestCase
.
tests/test_model.py
import tempfile
from dffml import train, score, predict, Feature, Features, AsyncTestCase
from dffml_model_myslr.myslr import MySLRModel
Test data¶
We usually try to randomly generate training and test data, but for this example we’re just going to hard code in some data.
tests/test_model.py
TRAIN_DATA = [
[12.4, 11.2],
[14.3, 12.5],
[14.5, 12.7],
[14.9, 13.1],
[16.1, 14.1],
[16.9, 14.8],
[16.5, 14.4],
[15.4, 13.4],
[17.0, 14.9],
[17.9, 15.6],
[18.8, 16.4],
[20.3, 17.7],
[22.4, 19.6],
[19.4, 16.9],
[15.5, 14.0],
[16.7, 14.6],
]
TEST_DATA = [
[17.3, 15.1],
[18.4, 16.1],
[19.2, 16.8],
[17.4, 15.2],
[19.5, 17.0],
[19.7, 17.2],
[21.2, 18.6],
]
TestCase Class¶
We create a temporary directory for our tests to use, and clean it up when they’re done.
tests/test_model.py
class TestMySLRModel(AsyncTestCase):
@classmethod
def setUpClass(cls):
# Create a temporary directory to store the trained model
cls.model_dir = tempfile.TemporaryDirectory()
# Create an instance of the model
cls.model = MySLRModel(
features=Features(Feature("X", float, 1)),
predict=Feature("Y", float, 1),
location=cls.model_dir.name,
)
cls.scorer = MeanSquaredErrorAccuracy()
@classmethod
def tearDownClass(cls):
# Remove the temporary directory where the model was stored to cleanup
cls.model_dir.cleanup()
Testing Train¶
Similarly to the quickstart, all we need to to is pass the model and training
data to the train
function.
The tests are prefixed with numbers to indicate what order they should be run in, ensuring that accuracy and predict tests always have a trained model to work with.
We’re using the *
operator here to expand the list of X, Y pair dict
’s.
See the official Python documentation about
Unpacking Argument Lists
for more information on how the *
-operator works.
tests/test_model.py
async def test_00_train(self):
# Train the model on the training data
await train(self.model, *[{"X": x, "Y": y} for x, y in TRAIN_DATA])
Testing Accuracy¶
Once again, all we need to to is pass the model and test data to the
score
function. Then we check if it’s in an
acceptable range. This test is helpful to make sure you never make any horribly
wrong changes to your model, since it will check that the accuracy is within an
acceptable range.
tests/test_model.py
async def test_01_accuracy(self):
# Use the test data to assess the model's accuracy
res = await score(
self.model,
self.scorer,
Feature("Y", float, 1),
*[{"X": x, "Y": y} for x, y in TEST_DATA],
)
# Ensure the accuracy is above 80%
self.assertTrue(0.0 <= res < 0.1)
Testing Prediction¶
Finally, we use the test data and model with the
predict
function. Then we check if each predicted Y
value is within 10% of what it should be.
tests/test_model.py
async def test_02_predict(self):
# Get the prediction for each piece of test data
async for i, features, prediction in predict(
self.model, *[{"X": x, "Y": y} for x, y in TEST_DATA]
):
# Grab the correct value
correct = features["Y"]
# Grab the predicted value
prediction = prediction["Y"]["value"]
# Check that the prediction is within 10% error of the actual value
acceptable = 0.1
self.assertLess(prediction, correct * (1.0 + acceptable))
self.assertGreater(prediction, correct * (1.0 - acceptable))
Run the tests¶
We can run the tests using the unittest
module. The create command gave us
both unit tests and integration tests. We want to only run the unit tests right
now (tests.test_model
).
$ python -m unittest -v tests.test_model
test_00_train (tests.test_model.TestMySLRModel) ... ok
test_01_accuracy (tests.test_model.TestMySLRModel) ... ok
test_02_predict (tests.test_model.TestMySLRModel) ... ok
----------------------------------------------------------------------
Ran 3 tests in 0.003s
OK
If you want to see the output of the call to self.logger.debug
, just set the
LOGGING
environment variable to debug
.
$ LOGGING=debug python -m unittest -v tests.test_model
Entrypoint Registration¶
In the Writing a Model tutorial we referenced the new model on the
command line via it’s entrypoint style path. This is in the format of
file:ClassWithinFile
, so for that tutorial it was myslr:MySLRModel
.
That requires that the file
be in a directory in current working directory,
or in a directory in the PYTHONPATH
environment variable.
We can instead reference it by a shorter name, but we have to declare that name
within the dffml.model
entrypoint in entry_points.txt. This tells the
Python packaging system that our package offers a plugin of the type
dffml.model
, and we give the short name on the left side of the equals, and
the entrypoint path on the right side.
entry_points.txt
[dffml.model]
myslr = dffml_model_myslr.myslr:MySLRModel
And remember that any time we modify the setup.py, we have to run the
setuptools egg_info
hook to register the model with the entry_points
system.
$ python setup.py egg_info
Command Line Usage¶
Let’s add some training data to a CSV file.
train.csv
Years,Salary
1,40
2,50
3,60
4,70
5,80
Since we’ve registered our model as a dffml.model
plugin, we can now
reference it by it’s short name.
$ dffml train \
-log debug \
-model myslr \
-model-features Years:int:1 \
-model-predict Salary:float:1 \
-model-location modeldir \
-sources f=csv \
-source-filename train.csv
Uploading to PyPi¶
You can now upload your package to PyPi, so that other’s can install it using
pip
, by running the following commands.
Warning
Don’t do this if you intened to contribute your model to the DFFML repo! Place the top level directory, dffml-model-myslr, into model/slr within the DFFML source tree, and submit a pull request.
$ python3 setup.py sdist && python3 -m twine upload dist/*