Writing a Model¶
For this tutorial we’ll be implementing our own machine learning algorithm from scratch, which means that we’re not using a machine learning focused library to handle calculations for us, we’ll do it all ourselves.
Our model will preform Simple Linear Regression (SLR). Which means it finds the best fit line for a dataset.
You may know the best fit line as y = m * x + b
We’ll be working the in a new file named myslr.py, open / create it now
Imports¶
We’re going to need a few modules from the standard library, let’s import them.
pathlib
will be used to define the directory were saved model state should be stored.statistics
will be used to help us calculate the average (mean) of our data.typing
is for Python’s static type hinting. It lets use give hints to our editor or IDE so they can help us check our code before we run it.
import pathlib
import statistics
from typing import AsyncIterator, Type
We’ll also need a few things from DFFML.
from dffml import (
config,
field,
entrypoint,
SimpleModel,
ModelNotTrained,
Feature,
Features,
SourcesContext,
Record,
)
Math¶
The first thing you’ll want to do is add some functions which calculate the best fit line and accuracy. These aren’t important for understanding how DFFML works. So we’re going to skip over their logic in this tutorial. You can write your own versions to find the best fit line for lists of X, and Y data if you want, or you can copy these.
def matrix_subtract(one, two):
return [
one_element - two_element for one_element, two_element in zip(one, two)
]
def matrix_multiply(one, two):
return [
one_element * two_element for one_element, two_element in zip(one, two)
]
def squared_error(y, line):
return sum(map(lambda element: element ** 2, matrix_subtract(y, line)))
def coeff_of_deter(y, regression_line):
y_mean_line = [statistics.mean(y)] * len(y)
squared_error_mean = squared_error(y, y_mean_line)
squared_error_regression = squared_error(y, regression_line)
# Handle 1.0 accuracy case
if squared_error_mean == 0:
return 1.0
return 1.0 - (squared_error_regression / squared_error_mean)
def best_fit_line(x, y):
mean_x = statistics.mean(x)
mean_y = statistics.mean(y)
m = (mean_x * mean_y - statistics.mean(matrix_multiply(x, y))) / (
(mean_x ** 2) - statistics.mean(matrix_multiply(x, x))
)
b = mean_y - (m * mean_x)
regression_line = [m * x_element + b for x_element in x]
accuracy = coeff_of_deter(y, regression_line)
return m, b, accuracy
Config¶
DFFML makes it so that we can use our models from the command line, HTTP API, and of course, Python. This all works because we define a config class for each model.
Anything that a user might want to tweak about a models behavior should go in
the Config
class for the model. The naming convention is TheName
+
Model
+ Config
.
Hyperparameters for a model should live inside the model’s config. Ideally at the top level and not nested within another structure.
Our model has three configurable properties.
features
. A list ofFeature
objects who’s names will be present within eachRecord
. Our model only supports a single feature. It will use the first feature in the features list as the X value for the regression line.predict
. The name of the feature we want to predict. Within eachRecord
this will be the data that our model should use as the Y value for the regression line.directory
. The location on disk where we’ll save and load our model from
@config
class MySLRModelConfig:
features: Features = field(
"Features to train on (myslr only supports one)"
)
predict: Feature = field("Label or the value to be predicted")
location: pathlib.Path = field("Location where state should be saved")
Class¶
The naming conventions for classes in DFFML is TheName
+ PluginType
.
We’re making a Model
plugin. So the name is MySLRModel
.
The plugin system needs us to do two things to ensure we can access our new model from the DFFML command line and other interfaces.
We must use the entrypoint
decorator passing it the name we want to
reference the model by. For anything that allows for specifying multiple
models, you can configure all models of the same type by using the string
provided here.
@entrypoint("myslr")
class MySLRModel(SimpleModel):
We must set the CONFIG
attribute to the respective Config
class.
# The configuration class needs to be set as the CONFIG property
CONFIG: Type = MySLRModelConfig
We can override the __init__()
method to do validation on the features
config property. Simple linear regression only supports one input feature, so we
will raise a ValueError
if the user supplys more than one feature.
def __init__(self, config):
super().__init__(config)
# Simple linear regression only supports a single input feature
if len(self.config.features) != 1:
raise ValueError("Model only support a single feature")
Train¶
The train method should train the model. First we ask the data sources
for
any records
containing the features our model was told to use.
with_features
takes a list of feature names each record should contain. This
let’s us avoid records that are not applicable to our model.
Models should save their state to disk after training. Classes derived from
SimpleModel
can put anything they want saved into self.storage
, which
is saved and loaded from a JSON file on disk.
async def train(self, sources: SourcesContext) -> None:
# X and Y data
x = []
y = []
# Go through all records that have the feature we're training on and the
# feature we want to predict.
async for record in sources.with_features(
[self.config.features[0].name, self.config.predict.name]
):
x.append(record.feature(self.config.features[0].name))
y.append(record.feature(self.config.predict.name))
# Use self.logger to report how many records are being used for training
self.logger.debug("Number of training records: %d", len(x))
# Save m, b, and accuracy
self.storage["regression_line"] = best_fit_line(x, y)
Predict¶
To make a prediction, we access the regression_line
within self.storage
,
we use the m
and b
indexes to calculate y = m * x + b
, we use the
accuracy
as our estimated confidence in the prediction.
We call record.predicted
passing it the name of the feature we predicted, the predicted value, and the
confidence in our prediction.
async def predict(self, sources: SourcesContext) -> AsyncIterator[Record]:
# Load saved regression line
regression_line = self.storage.get("regression_line", None)
# Ensure the model has been trained before we try to make a prediction
if not self.is_trained:
raise ModelNotTrained("Train model before prediction")
# Expand the regression_line into named variables
m, b, accuracy = regression_line
# Iterate through each record that needs a prediction
async for record in sources.with_features(
[self.config.features[0].name]
):
# Grab the x data from the record
x = record.feature(self.config.features[0].name)
# Calculate y
y = m * x + b
# Set the calculated value with the estimated accuracy
record.predicted(self.config.predict.name, y, accuracy)
# Yield the record to the caller
yield record
Python Usage¶
We can use our new model from Python code as follows. This example makes use of
dffml.noasync
which contains versions of train
, accuracy
, and
predict
which we don’t have to be in an async
function to call.
Let’s first create our training, test, and prediction data CSV files.
train.csv
Years,Salary
1,40
2,50
3,60
4,70
5,80
test.csv
Years,Salary
6,90
7,100
predict.csv
Years
8
Then we can write our Python file, run.py.
run.py
import asyncio
from dffml import (
MeanSquaredErrorAccuracy,
Features,
Feature,
train,
score,
predict,
)
from myslr import MySLRModel
async def main():
# Configure the model
model = MySLRModel(
features=Features(Feature("Years", int, 1)),
predict=Feature("Salary", int, 1),
location="model",
)
# Train the model
await train(model, "train.csv")
# Assess accuracy
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
await score(model, scorer, Feature("Salary", int, 1), "test.csv"),
)
# Make predictions
async for i, features, prediction in predict(model, "predict.csv"):
features["Salary"] = prediction["Salary"]["value"]
print(features)
if __name__ == "__main__":
asyncio.run(main())
We run it as we would any other Python file
$ python3 run.py
Accuracy: 1.0
{'Years': 8, 'Salary': 110.0}
Command Line Usage¶
To use your new model on the command line we’ll reference it by it’s entrypoint
style path. This is in the format of file:ClassWithinFile
, so for this it’ll
be myslr:MySLRModel
.
We do the same steps we did with Python, only using the command line interface.
$ dffml train \
-log debug \
-model myslr:MySLRModel \
-model-features Years:int:1 \
-model-predict Salary:float:1 \
-model-location modeldir \
-sources f=csv \
-source-filename train.csv
There’s no output from the training command if everything went well
Now let’s make predictions
$ dffml predict all \
-model myslr:MySLRModel \
-model-features Years:int:1 \
-model-predict Salary:float:1 \
-model-location modeldir \
-sources f=csv \
-source-filename predict.csv
[
{
"extra": {},
"features": {
"Years": 8
},
"key": "0",
"last_updated": "2020-05-24T22:48:11Z",
"prediction": {
"Salary": {
"confidence": 1.0,
"value": 110.0
}
}
}
]
HTTP Server Usage¶
Your model will also be accessible via the HTTP API via a similar syntax.
First we need to install the HTTP service, which is the HTTP server which will serve our model. See the HTTP API docs for more information on the HTTP service.
$ python -m pip install -U dffml-service-http
We start the HTTP service and tell it that we want to make our model accessable via the HTTP Model API.
Warning
You should be sure to read the Security docs! This example of running the HTTP API is insecure and is only used to help you get up and running.
$ dffml service http server -insecure -cors '*' -addr 0.0.0.0 -port 8080 \
-models mymodel=myslr:MySLRModel \
-model-features Years:int:1 \
-model-predict Salary:float:1 \
-model-location modeldir
We can then ask the HTTP service to make predictions, or do training or accuracy assessment.
$ curl -f http://localhost:8080/model/mymodel/predict/0 \
--header "Content-Type: application/json" \
--data '{"0": {"features": {"Years": 8}}}'
{
"iterkey": null,
"records": {
"0": {
"key": "0",
"features": {
"Years": 8
},
"prediction": {
"Salary": {
"confidence": 1.0,
"value": 110.0
}
},
"last_updated": "2020-04-14T20:07:11Z",
"extra": {}
}
}
}