Writing a Model

For this tutorial we’ll be implementing our own machine learning algorithm from scratch, which means that we’re not using a machine learning focused library to handle calculations for us, we’ll do it all ourselves.

Our model will preform Simple Linear Regression (SLR). Which means it finds the best fit line for a dataset.

You may know the best fit line as y = m * x + b

We’ll be working the in a new file named myslr.py, open / create it now

Imports

We’re going to need a few modules from the standard library, let’s import them.

  • pathlib will be used to define the directory were saved model state should be stored.

  • statistics will be used to help us calculate the average (mean) of our data.

  • typing is for Python’s static type hinting. It lets use give hints to our editor or IDE so they can help us check our code before we run it.

import pathlib
import statistics
from typing import AsyncIterator, Type

We’ll also need a few things from DFFML.

from dffml import (
    config,
    field,
    entrypoint,
    SimpleModel,
    ModelNotTrained,
    Feature,
    Features,
    SourcesContext,
    Record,
)

Math

The first thing you’ll want to do is add some functions which calculate the best fit line and accuracy. These aren’t important for understanding how DFFML works. So we’re going to skip over their logic in this tutorial. You can write your own versions to find the best fit line for lists of X, and Y data if you want, or you can copy these.

def matrix_subtract(one, two):
    return [
        one_element - two_element for one_element, two_element in zip(one, two)
    ]


def matrix_multiply(one, two):
    return [
        one_element * two_element for one_element, two_element in zip(one, two)
    ]


def squared_error(y, line):
    return sum(map(lambda element: element ** 2, matrix_subtract(y, line)))


def coeff_of_deter(y, regression_line):
    y_mean_line = [statistics.mean(y)] * len(y)
    squared_error_mean = squared_error(y, y_mean_line)
    squared_error_regression = squared_error(y, regression_line)
    # Handle 1.0 accuracy case
    if squared_error_mean == 0:
        return 1.0
    return 1.0 - (squared_error_regression / squared_error_mean)


def best_fit_line(x, y):
    mean_x = statistics.mean(x)
    mean_y = statistics.mean(y)
    m = (mean_x * mean_y - statistics.mean(matrix_multiply(x, y))) / (
        (mean_x ** 2) - statistics.mean(matrix_multiply(x, x))
    )
    b = mean_y - (m * mean_x)
    regression_line = [m * x_element + b for x_element in x]
    accuracy = coeff_of_deter(y, regression_line)
    return m, b, accuracy

Config

DFFML makes it so that we can use our models from the command line, HTTP API, and of course, Python. This all works because we define a config class for each model.

Anything that a user might want to tweak about a models behavior should go in the Config class for the model. The naming convention is TheName + Model + Config.

Hyperparameters for a model should live inside the model’s config. Ideally at the top level and not nested within another structure.

Our model has three configurable properties.

  • features. A list of Feature objects who’s names will be present within each Record. Our model only supports a single feature. It will use the first feature in the features list as the X value for the regression line.

  • predict. The name of the feature we want to predict. Within each Record this will be the data that our model should use as the Y value for the regression line.

  • directory. The location on disk where we’ll save and load our model from

@config
class MySLRModelConfig:
    features: Features = field(
        "Features to train on (myslr only supports one)"
    )
    predict: Feature = field("Label or the value to be predicted")
    location: pathlib.Path = field("Location where state should be saved")

Class

The naming conventions for classes in DFFML is TheName + PluginType. We’re making a Model plugin. So the name is MySLRModel.

The plugin system needs us to do two things to ensure we can access our new model from the DFFML command line and other interfaces.

We must use the entrypoint decorator passing it the name we want to reference the model by. For anything that allows for specifying multiple models, you can configure all models of the same type by using the string provided here.

@entrypoint("myslr")
class MySLRModel(SimpleModel):

We must set the CONFIG attribute to the respective Config class.

    # The configuration class needs to be set as the CONFIG property
    CONFIG: Type = MySLRModelConfig

We can override the __init__() method to do validation on the features config property. Simple linear regression only supports one input feature, so we will raise a ValueError if the user supplys more than one feature.

    def __init__(self, config):
        super().__init__(config)
        # Simple linear regression only supports a single input feature
        if len(self.config.features) != 1:
            raise ValueError("Model only support a single feature")

Train

The train method should train the model. First we ask the data sources for any records containing the features our model was told to use. with_features takes a list of feature names each record should contain. This let’s us avoid records that are not applicable to our model.

Models should save their state to disk after training. Classes derived from SimpleModel can put anything they want saved into self.storage, which is saved and loaded from a JSON file on disk.

    async def train(self, sources: SourcesContext) -> None:
        # X and Y data
        x = []
        y = []
        # Go through all records that have the feature we're training on and the
        # feature we want to predict.
        async for record in sources.with_features(
            [self.config.features[0].name, self.config.predict.name]
        ):
            x.append(record.feature(self.config.features[0].name))
            y.append(record.feature(self.config.predict.name))
        # Use self.logger to report how many records are being used for training
        self.logger.debug("Number of training records: %d", len(x))
        # Save m, b, and accuracy
        self.storage["regression_line"] = best_fit_line(x, y)

Predict

To make a prediction, we access the regression_line within self.storage, we use the m and b indexes to calculate y = m * x + b, we use the accuracy as our estimated confidence in the prediction.

We call record.predicted passing it the name of the feature we predicted, the predicted value, and the confidence in our prediction.

    async def predict(self, sources: SourcesContext) -> AsyncIterator[Record]:
        # Load saved regression line
        regression_line = self.storage.get("regression_line", None)
        # Ensure the model has been trained before we try to make a prediction
        if not self.is_trained:
            raise ModelNotTrained("Train model before prediction")
        # Expand the regression_line into named variables
        m, b, accuracy = regression_line
        # Iterate through each record that needs a prediction
        async for record in sources.with_features(
            [self.config.features[0].name]
        ):
            # Grab the x data from the record
            x = record.feature(self.config.features[0].name)
            # Calculate y
            y = m * x + b
            # Set the calculated value with the estimated accuracy
            record.predicted(self.config.predict.name, y, accuracy)
            # Yield the record to the caller
            yield record

Python Usage

We can use our new model from Python code as follows. This example makes use of dffml.noasync which contains versions of train, accuracy, and predict which we don’t have to be in an async function to call.

Let’s first create our training, test, and prediction data CSV files.

train.csv

Years,Salary
1,40
2,50
3,60
4,70
5,80

test.csv

Years,Salary
6,90
7,100

predict.csv

Years
8

Then we can write our Python file, run.py.

run.py

import asyncio

from dffml import (
    MeanSquaredErrorAccuracy,
    Features,
    Feature,
    train,
    score,
    predict,
)

from myslr import MySLRModel


async def main():
    # Configure the model
    model = MySLRModel(
        features=Features(Feature("Years", int, 1)),
        predict=Feature("Salary", int, 1),
        location="model",
    )

    # Train the model
    await train(model, "train.csv")

    # Assess accuracy
    scorer = MeanSquaredErrorAccuracy()
    print(
        "Accuracy:",
        await score(model, scorer, Feature("Salary", int, 1), "test.csv"),
    )

    # Make predictions
    async for i, features, prediction in predict(model, "predict.csv"):
        features["Salary"] = prediction["Salary"]["value"]
        print(features)


if __name__ == "__main__":
    asyncio.run(main())

We run it as we would any other Python file

$ python3 run.py
Accuracy: 1.0
{'Years': 8, 'Salary': 110.0}

Command Line Usage

To use your new model on the command line we’ll reference it by it’s entrypoint style path. This is in the format of file:ClassWithinFile, so for this it’ll be myslr:MySLRModel.

We do the same steps we did with Python, only using the command line interface.

$ dffml train \
    -log debug \
    -model myslr:MySLRModel \
    -model-features Years:int:1 \
    -model-predict Salary:float:1 \
    -model-location modeldir \
    -sources f=csv \
    -source-filename train.csv

There’s no output from the training command if everything went well

Now let’s make predictions

$ dffml predict all \
    -model myslr:MySLRModel \
    -model-features Years:int:1 \
    -model-predict Salary:float:1 \
    -model-location modeldir \
    -sources f=csv \
    -source-filename predict.csv
[
    {
        "extra": {},
        "features": {
            "Years": 8
        },
        "key": "0",
        "last_updated": "2020-05-24T22:48:11Z",
        "prediction": {
            "Salary": {
                "confidence": 1.0,
                "value": 110.0
            }
        }
    }
]

HTTP Server Usage

Your model will also be accessible via the HTTP API via a similar syntax.

First we need to install the HTTP service, which is the HTTP server which will serve our model. See the HTTP API docs for more information on the HTTP service.

$ python -m pip install -U dffml-service-http

We start the HTTP service and tell it that we want to make our model accessable via the HTTP Model API.

Warning

You should be sure to read the Security docs! This example of running the HTTP API is insecure and is only used to help you get up and running.

$ dffml service http server -insecure -cors '*' -addr 0.0.0.0 -port 8080 \
    -models mymodel=myslr:MySLRModel \
    -model-features Years:int:1 \
    -model-predict Salary:float:1 \
    -model-location modeldir

We can then ask the HTTP service to make predictions, or do training or accuracy assessment.

$ curl -f http://localhost:8080/model/mymodel/predict/0 \
    --header "Content-Type: application/json" \
    --data '{"0": {"features": {"Years": 8}}}'
{
    "iterkey": null,
    "records": {
        "0": {
            "key": "0",
            "features": {
                "Years": 8
            },
            "prediction": {
                "Salary": {
                    "confidence": 1.0,
                    "value": 110.0
                }
            },
            "last_updated": "2020-04-14T20:07:11Z",
            "extra": {}
        }
    }
}