Packaging a Model

In the previous tutorial we created a DFFML style model, which we could use from the command line, HTTP service, etc. We’re now going to take that model and package it so that it can be published to PyPi for others to download via pip and use as they do the rest of the Models plugins.

Create the Package

To create a new model we first create a new Python package. DFFML has a helper to create it for you.

The helper creates a model which does Simple Linear Regression (SLR). Which means it finds the best fit line for a dataset. If you’ve done the Writing a Model tutorial then the same code from that tutorial in myslr.py will be present in dffml_model_myslr/myslr.py (not your modifications though if you made any).

You may know the best fit line as y = m * x + b

$ dffml service dev create model dffml-model-myslr

Then enter the directory of the package we just created

$ cd dffml-model-myslr

Install the Package

If you’re planing on importing any third party packages, anything on PyPi, you’ll want to add it to the setup.cfg file first, under the install_requires section.

setup.cfg

    scikit-learn>=0.21.2

Any time you modify the dependencies of a package you should re-install it so they get installed as well.

pip’s -e flag tells it we’re installing our package in development mode, which means anytime Python import’s our package, it’s going to use the version that we’re working on here. If you don’t pass -e then anytime you make changes in this directory, they won’t take effect until you reinstall the package.

$ python -m pip install -e .[dev]

Testing

Packages should have tests to make sure that when you change things they don’t break before you release your package to users.

Imports

Let’s import everything we’ll need. Since DFFML heavily leverages async code, we’ll have our test cases derive from AsyncTestCase, rather than unittest.TestCase.

tests/test_model.py

import tempfile

from dffml import train, score, predict, Feature, Features, AsyncTestCase

from dffml_model_myslr.myslr import MySLRModel

Test data

We usually try to randomly generate training and test data, but for this example we’re just going to hard code in some data.

tests/test_model.py

TRAIN_DATA = [
    [12.4, 11.2],
    [14.3, 12.5],
    [14.5, 12.7],
    [14.9, 13.1],
    [16.1, 14.1],
    [16.9, 14.8],
    [16.5, 14.4],
    [15.4, 13.4],
    [17.0, 14.9],
    [17.9, 15.6],
    [18.8, 16.4],
    [20.3, 17.7],
    [22.4, 19.6],
    [19.4, 16.9],
    [15.5, 14.0],
    [16.7, 14.6],
]

TEST_DATA = [
    [17.3, 15.1],
    [18.4, 16.1],
    [19.2, 16.8],
    [17.4, 15.2],
    [19.5, 17.0],
    [19.7, 17.2],
    [21.2, 18.6],
]

TestCase Class

We create a temporary directory for our tests to use, and clean it up when they’re done.

tests/test_model.py

class TestMySLRModel(AsyncTestCase):
    @classmethod
    def setUpClass(cls):
        # Create a temporary directory to store the trained model
        cls.model_dir = tempfile.TemporaryDirectory()
        # Create an instance of the model
        cls.model = MySLRModel(
            features=Features(Feature("X", float, 1)),
            predict=Feature("Y", float, 1),
            location=cls.model_dir.name,
        )
        cls.scorer = MeanSquaredErrorAccuracy()

    @classmethod
    def tearDownClass(cls):
        # Remove the temporary directory where the model was stored to cleanup
        cls.model_dir.cleanup()

Testing Train

Similarly to the quickstart, all we need to to is pass the model and training data to the train function.

The tests are prefixed with numbers to indicate what order they should be run in, ensuring that accuracy and predict tests always have a trained model to work with.

We’re using the * operator here to expand the list of X, Y pair dict’s. See the official Python documentation about Unpacking Argument Lists for more information on how the *-operator works.

tests/test_model.py

    async def test_00_train(self):
        # Train the model on the training data
        await train(self.model, *[{"X": x, "Y": y} for x, y in TRAIN_DATA])

Testing Accuracy

Once again, all we need to to is pass the model and test data to the score function. Then we check if it’s in an acceptable range. This test is helpful to make sure you never make any horribly wrong changes to your model, since it will check that the accuracy is within an acceptable range.

tests/test_model.py

    async def test_01_accuracy(self):
        # Use the test data to assess the model's accuracy
        res = await score(
            self.model,
            self.scorer,
            Feature("Y", float, 1),
            *[{"X": x, "Y": y} for x, y in TEST_DATA],
        )
        # Ensure the accuracy is above 80%
        self.assertTrue(0.0 <= res < 0.1)

Testing Prediction

Finally, we use the test data and model with the predict function. Then we check if each predicted Y value is within 10% of what it should be.

tests/test_model.py

    async def test_02_predict(self):
        # Get the prediction for each piece of test data
        async for i, features, prediction in predict(
            self.model, *[{"X": x, "Y": y} for x, y in TEST_DATA]
        ):
            # Grab the correct value
            correct = features["Y"]
            # Grab the predicted value
            prediction = prediction["Y"]["value"]
            # Check that the prediction is within 10% error of the actual value
            acceptable = 0.1
            self.assertLess(prediction, correct * (1.0 + acceptable))
            self.assertGreater(prediction, correct * (1.0 - acceptable))

Run the tests

We can run the tests using the unittest module. The create command gave us both unit tests and integration tests. We want to only run the unit tests right now (tests.test_model).

$ python -m unittest -v tests.test_model
test_00_train (tests.test_model.TestMySLRModel) ... ok
test_01_accuracy (tests.test_model.TestMySLRModel) ... ok
test_02_predict (tests.test_model.TestMySLRModel) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.003s

OK

If you want to see the output of the call to self.logger.debug, just set the LOGGING environment variable to debug.

$ LOGGING=debug python -m unittest -v tests.test_model

Entrypoint Registration

In the Writing a Model tutorial we referenced the new model on the command line via it’s entrypoint style path. This is in the format of file:ClassWithinFile, so for that tutorial it was myslr:MySLRModel.

That requires that the file be in a directory in current working directory, or in a directory in the PYTHONPATH environment variable.

We can instead reference it by a shorter name, but we have to declare that name within the dffml.model entrypoint in entry_points.txt. This tells the Python packaging system that our package offers a plugin of the type dffml.model, and we give the short name on the left side of the equals, and the entrypoint path on the right side.

entry_points.txt

[dffml.model]
myslr = dffml_model_myslr.myslr:MySLRModel

And remember that any time we modify the setup.py, we have to run the setuptools egg_info hook to register the model with the entry_points system.

$ python setup.py egg_info

Command Line Usage

Let’s add some training data to a CSV file.

train.csv

Years,Salary
1,40
2,50
3,60
4,70
5,80

Since we’ve registered our model as a dffml.model plugin, we can now reference it by it’s short name.

$ dffml train \
    -log debug \
    -model myslr \
    -model-features Years:int:1 \
    -model-predict Salary:float:1 \
    -model-location modeldir \
    -sources f=csv \
    -source-filename train.csv

Uploading to PyPi

You can now upload your package to PyPi, so that other’s can install it using pip, by running the following commands.

Warning

Don’t do this if you intened to contribute your model to the DFFML repo! Place the top level directory, dffml-model-myslr, into model/slr within the DFFML source tree, and submit a pull request.

$ python3 setup.py sdist && python3 -m twine upload dist/*