Quickstart¶
In this example we have employee data telling us the employee’s years of experience, our level of trust in them, their level of expertise, and their salary. Our goal will be to predict what the salary of a new hire should be, given their years of experience, our level of trust in them, and their level of expertise.
The model we’ll be using is a part of dffml-model-scikit
, which is
another separate Python package from DFFML which we can install via pip
.
$ pip install -U dffml-model-scikit
We will be using scikit’s linear regression model. Other available models can be found on the Models plugin page.
Example Dataset¶
We’ll be using the following simple dataset for this example.
Years of Experience |
Expertise |
Trust Factor |
Salary |
---|---|---|---|
0 |
1 |
0.1 |
10 |
1 |
3 |
0.2 |
20 |
2 |
5 |
0.3 |
30 |
3 |
7 |
0.4 |
40 |
4 |
9 |
0.5 |
50 |
5 |
11 |
0.6 |
60 |
Rows 0-3 will be used as the training data, and 4-5 will be used as the test data. We’ll be asking for a prediction of the salary for the following.
Years of Experience |
Expertise |
Trust Factor |
---|---|---|
6 |
13 |
0.7 |
7 |
15 |
0.8 |
Let’s create the training.csv
, test.csv
, and predict.csv
files.
cat > training.csv << EOF
Years,Expertise,Trust,Salary
0,1,0.1,10
1,3,0.2,20
2,5,0.3,30
3,7,0.4,40
EOF
cat > test.csv << EOF
Years,Expertise,Trust,Salary
4,9,0.5,50
5,11,0.6,60
EOF
cat > predict.csv << EOF
Years,Expertise,Trust
6,13,0.7
7,15,0.8
EOF
Command Line¶
For each command, we specify which model we want to use, and what our data sources are. A detailed explanation of all the command line flags follows.
First we train the model. Our data source is the training.csv
file.
dffml train \
-model scikitlr \
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
-model-predict Salary:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename training.csv
We then assess the models accuracy using the test data from test.csv
.
dffml accuracy \
-model scikitlr \
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
-model-predict Salary:float:1 \
-model-location tempdir \
-features Salary:float:1 \
-sources f=csv \
-source-filename test.csv \
-scorer mse
The test and training data are very simple, so the model should report 100% accuracy.
1.0
Finally, we ask the model to make a prediction for each row in the
predict.csv
file.
dffml predict all \
-model scikitlr \
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
-model-predict Salary:float:1 \
-model-location tempdir \
-sources f=csv \
-source-filename predict.csv
DFFML outputs in JSON format since it’s very common and makes it easy to use the DFFML command line from other scripts or languages.
[
{
"extra": {},
"features": {
"Expertise": 13,
"Trust": 0.7,
"Years": 6
},
"key": "0",
"last_updated": "2020-02-07T14:17:08Z",
"prediction": {
"Salary": {
"confidence": 1.0,
"value": 70.13972055888223
}
}
},
{
"extra": {},
"features": {
"Expertise": 15,
"Trust": 0.8,
"Years": 7
},
"key": "1",
"last_updated": "2020-02-07T14:17:08Z",
"prediction": {
"Salary": {
"confidence": 1.0,
"value": 80.15968063872255
}
}
}
]
The "confidence"
value is determined by the underlying model implementation.
The scikit linear regression model just reports whatever the accuracy was on the
test dataset as the confidence.
Command Line Flags Explained¶
-model scikitlr
Use the linear regression model from the
dffml-model-scikit
package. More options for the model to use can be found on the Models plugin page.
-model-features Years:int:1 Expertise:int:1 Trust:float:1
The features the model should use to learn from. We specify the following attributes about each feature.
Name as it will appear in our data source (the column header in the
training.csv
file).Data type. Years and Expertise are whole number (integer values), so we say
int
. Trust is a percentage, measured from 0.0 being 0% to 1.0 being 100%. For numbers with decimal places in them we sayfloat
.Dimensions of data.
All our features are a single value, so we say
1
.If we had a feature which was ten values, maybe some kind of time series data, we’d say
10
. If we had a feature which was a flattened 28 by 28 image we’d say784
.
-model-predict Salary:int:1
The feature the model is trying to learn how to predict. We specify the same details as we did with the features to learn from.
-scorer mse
Report the modules accuracy use the Mead Squared Error accuracy scorer. See the Scorers plugin page for all accuracy scorers.
-sources f=csv
The data sources to use as training data for the model.
We can specify multiple sources here in the following fashion.
-sources one=csv two=csv three=json
Sources are
tagged
, which just means that since there can be multiple you need to specify a tag to reference the source by when configuring it. On the left side of the=
we put the tag, on the right we put the plugin name of the source. We’re using the source that reads.csv
files, so we specifycsv
. The list of sources can be found on the Sources plugin page.We specify
f
as the tag, we could’ve used anything, butf
will do fine (f
was chosen because it’s a file).
-source-filename $stage.csv
The filename to be used for the source. On the Sources plugin page you can see all of the source plugins and their possible arguments.
With regards to
tagged
sources, we could have also specified the filename using-source-f-filename $stage.csv
, since we tagged it asf
. If you have multiple sources you need to specify the arguments to each this way.
Python¶
If we wanted to do everything within Python our file might look like this
from dffml import Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_scikit import LinearRegressionModel
from dffml.accuracy import MeanSquaredErrorAccuracy
model = LinearRegressionModel(
features=Features(
Feature("Years", int, 1),
Feature("Expertise", int, 1),
Feature("Trust", float, 1),
),
predict=Feature("Salary", int, 1),
location="tempdir",
)
# Train the model
train(
model,
{"Years": 0, "Expertise": 1, "Trust": 0.1, "Salary": 10},
{"Years": 1, "Expertise": 3, "Trust": 0.2, "Salary": 20},
{"Years": 2, "Expertise": 5, "Trust": 0.3, "Salary": 30},
{"Years": 3, "Expertise": 7, "Trust": 0.4, "Salary": 40},
)
# Assess accuracy
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
score(
model,
scorer,
Feature("Salary", int, 1),
{"Years": 4, "Expertise": 9, "Trust": 0.5, "Salary": 50},
{"Years": 5, "Expertise": 11, "Trust": 0.6, "Salary": 60},
),
)
# Make prediction
for i, features, prediction in predict(
model,
{"Years": 6, "Expertise": 13, "Trust": 0.7},
{"Years": 7, "Expertise": 15, "Trust": 0.8},
):
features["Salary"] = prediction["Salary"]["value"]
print(features)
The output should be as follows
Accuracy: 1.0
{'Years': 6, 'Expertise': 13, 'Trust': 0.7, 'Salary': 70.0}
{'Years': 7, 'Expertise': 15, 'Trust': 0.8, 'Salary': 80.0}
Check out the plugin docs for Models for usage of other models. The API docs may also be useful.
Data Sources¶
DFFML makes it easy to pull data from various sources. All we have to do is supply the filenames in place of the data.
from dffml import CSVSource, Features, Feature
from dffml.noasync import train, score, predict
from dffml_model_scikit import LinearRegressionModel
from dffml.accuracy import MeanSquaredErrorAccuracy
model = LinearRegressionModel(
features=Features(
Feature("Years", int, 1),
Feature("Expertise", int, 1),
Feature("Trust", float, 1),
),
predict=Feature("Salary", int, 1),
location="tempdir",
)
# Train the model
train(model, "training.csv")
# Assess accuracy (alternate way of specifying data source)
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
score(
model,
scorer,
Feature("Salary", int, 1),
CSVSource(filename="test.csv"),
),
)
# Make prediction
for i, features, prediction in predict(
model,
{"Years": 6, "Expertise": 13, "Trust": 0.7},
{"Years": 7, "Expertise": 15, "Trust": 0.8},
):
features["Salary"] = prediction["Salary"]["value"]
print(features)
Async¶
You may have noticed we’re importing from dffml.noasync
. If you’re using
asyncio
then you can just import from dffml
.
import asyncio
from dffml import train, score, predict, Features, Feature
from dffml.accuracy import MeanSquaredErrorAccuracy
from dffml_model_scikit import LinearRegressionModel
async def main():
model = LinearRegressionModel(
features=Features(
Feature("Years", int, 1),
Feature("Expertise", int, 1),
Feature("Trust", float, 1),
),
predict=Feature("Salary", int, 1),
location="tempdir",
)
# Train the model
await train(
model,
{"Years": 0, "Expertise": 1, "Trust": 0.1, "Salary": 10},
{"Years": 1, "Expertise": 3, "Trust": 0.2, "Salary": 20},
{"Years": 2, "Expertise": 5, "Trust": 0.3, "Salary": 30},
{"Years": 3, "Expertise": 7, "Trust": 0.4, "Salary": 40},
)
# Assess accuracy
scorer = MeanSquaredErrorAccuracy()
print(
"Accuracy:",
await score(
model,
scorer,
Feature("Salary", int, 1),
{"Years": 4, "Expertise": 9, "Trust": 0.5, "Salary": 50},
{"Years": 5, "Expertise": 11, "Trust": 0.6, "Salary": 60},
),
)
# Make prediction
async for i, features, prediction in predict(
model,
{"Years": 6, "Expertise": 13, "Trust": 0.7},
{"Years": 7, "Expertise": 15, "Trust": 0.8},
):
features["Salary"] = prediction["Salary"]["value"]
print(features)
asyncio.run(main())
HTTP¶
We can also deploy our trained model behind an HTTP server.
First we need to install the HTTP service, which is the HTTP server which will serve our model. See the HTTP API docs for more information on the HTTP service.
$ pip install -U dffml-service-http
We start the HTTP service and tell it that we want to make our model accessible via the HTTP Model API.
Warning
You should be sure to read the Security docs! This example of running the HTTP API is insecure and is only used to help you get up and running.
dffml service http server -insecure -cors '*' -addr 0.0.0.0 -port 8080 \
-models mymodel=scikitlr \
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
-model-predict Salary:float:1 \
-model-location tempdir
We can then ask the HTTP service to make predictions, or do training or accuracy assessment.
curl http://localhost:8080/model/mymodel/predict/0 \
--header "Content-Type: application/json" \
--data '{"0": {"features": {"Expertise": 17, "Trust": 0.9, "Years": 8}}}' \
| python -m json.tool
You should see the following prediction
{
"iterkey": null,
"records": {
"0": {
"key": "0",
"features": {
"Expertise": 17,
"Trust": 0.9,
"Years": 8
},
"prediction": {
"Salary": {
"confidence": 1.0,
"value": 90.00000000000001
}
},
"last_updated": "2020-04-14T20:07:11Z",
"extra": {}
}
}
}