Automating Classification¶
This example will show you how to automate a manual classification process, determining if a Git repo is maintained or abandoned. We’ll be integrating Machine Learning into an existing application.
For this example assume you are a very curious open source software person. You love looking at git repos in your free time. Every time you come across a Git repo, you dig around, look at the number of committers, number of commits, etc. and come up with maintenance status for the repo (maintained, or unmaintained).
To help you track your assigned statues you built a Python 3 CGI web app.
When you assign a maintenance status to a repo you are classifying it. Instead of manually classifying each repo, we’re going to put some machine learning behind this little web app. The steps we’re going to follow here are generic, they do not apply specifically to this Python CGI setup, this is just an example that was chosen because hopefully it’ll be an easy setup.
CGI Application¶
The CGI application we’ll be integrating with is comprised of five files. You should create these files and populate them with the code your about to see.
The webpage index.html
The webpage’s JavaScript main.js
The webpage’s CSS theme.css
The backend api cgi-bin/api.py
First create the webpage
index.html
<!DOCTYPE html>
<html>
<head>
<title>Git Repo Maintenance Tracker</title>
<script type="text/javascript" src="main.js"></script>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>
<body>
<h1>Git Repo Maintenance Tracker</h1>
<hr />
<div>
<p>Submit or change the maintenance status of a repo.</p>
<input id="URL" placeholder="git://server.git/repo"></input>
<button id="maintained" class="good">Maintained</button>
<button id="unmaintained" class="dangerous">Unmaintained</button>
</div>
<br />
<div>
<table id="table" style="width:100%">
</table>
</div>
</body>
</html>
Then the JavaScript, which will interact with the CGI API.
main.js
function populate_table(tableDOM, URLs) {
var row;
var col;
// Clear the table
tableDOM.innerHTML = '';
// Headers
row = document.createElement('tr');
tableDOM.appendChild(row)
// URL
col = document.createElement('td');
col.innerText = 'URL';
row.appendChild(col)
// Status
col = document.createElement('td');
col.innerText = 'Maintained?';
row.appendChild(col)
// Create body
for (var URL in URLs) {
// Convert from int to string
if (URLs[URL]) {
URLs[URL] = 'Yes';
} else {
URLs[URL] = 'No';
}
row = document.createElement('tr');
table.appendChild(row);
// URL
col = document.createElement('td');
col.innerText = URL;
row.appendChild(col)
// Status
col = document.createElement('td');
col.innerText = URLs[URL];
row.appendChild(col)
}
}
function refreshTable(tableDOM) {
return fetch('cgi-bin/api.py?action=dump')
.then(function(response) {
return response.json()
})
.then(function(URLs) {
populate_table(tableDOM, URLs);
});
}
function setMaintenance(URL, maintained) {
return fetch('cgi-bin/api.py?action=set' +
'&maintained=' + Number(maintained) +
'&URL=' + URL)
.then(function(response) {
return response.json()
});
}
window.addEventListener('DOMContentLoaded', function(event) {
var tableDOM = document.getElementById('table');
var URLDOM = document.getElementById('URL');
var maintainedDOM = document.getElementById('maintained');
var unmaintainedDOM = document.getElementById('unmaintained');
maintainedDOM.addEventListener('click', function(event) {
setMaintenance(URLDOM.value, true)
.then(function() {
refreshTable(tableDOM);
});
});
unmaintainedDOM.addEventListener('click', function(event) {
setMaintenance(URLDOM.value, false)
.then(function() {
refreshTable(tableDOM);
});
});
refreshTable(tableDOM);
});
The style sheet which gives the page colors and styling
theme.css
.good {
background-color: lightgreen;
}
.dangerous {
background-color: red;
}
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
Now create the cgi-bin directory which is where our server side Python script will run.
$ mkdir cgi-bin
Then create the backend API script. This script connects to the database and provides two actions.
Classify a repo by setting its status to maintained or unmaintained
Dump all repos along with their maintenance status which we’ve classified
cgi-bin/api.py
#!/usr/bin/env python3
import os
import sys
import json
import urllib.parse
import mysql.connector
print("Content-Type: application/json")
print()
query = dict(urllib.parse.parse_qsl(os.getenv("QUERY_STRING", default="")))
action = query.get("action", None)
if action is None:
print(json.dumps({"error": "Missing 'action' query parameter"}))
sys.exit(1)
cnx = mysql.connector.connect(
user="user", passwd="pass", database="db", port=3306,
)
cursor = cnx.cursor()
if action == "dump":
cursor.execute("SELECT `key`, `maintained` FROM `status`")
print(json.dumps(dict(cursor)))
elif action == "set":
cursor.execute(
"REPLACE INTO status (`key`, `maintained`) VALUES(%s, %s)",
(query["URL"], query["maintained"],),
)
cnx.commit()
print(json.dumps({"success": True}))
else:
print(json.dumps({"error": "Unknown action"}))
sys.stdout.flush()
cursor.close()
cnx.close()
We have to make the cgi-bin directory and API file executable so that the CGI server can run it.
$ chmod 755 cgi-bin cgi-bin/api.py
Setup¶
We’ll be using Python, and docker or podman. If you don’t have docker installed, but you do have podman installed, you can replace “docker” with “podman” in all of the following commands.
Create a virtual environment where we’ll install all our Python packages to.
Make sure to update pip
in case it’s old, and install setuptools
and
wheel
so that we can install the MySQL package.
$ python -m venv .venv
$ . .venv/bin/activate
$ python -m pip install -U pip setuptools wheel
Download the Python client libraries for MySQL.
$ python -m pip install -U \
https://dev.mysql.com/get/Downloads/Connector-Python/mysql-connector-python-8.0.21.tar.gz
Start MariaDB (functionally very similar to MySQL which its a fork of).
$ docker run --rm -d --name maintained_db \
-e MYSQL_RANDOM_ROOT_PASSWORD=yes \
-e MYSQL_USER=user \
-e MYSQL_PASSWORD=pass \
-e MYSQL_DATABASE=db \
-p 3306:3306 \
mariadb:10
Wait for the database to start. Run the following command until you see ready
for connections
twice in the output.
$ docker logs maintained_db 2>&1 | grep 'ready for'
2020-01-13 21:31:09 0 [Note] mysqld: ready for connections.
2020-01-13 21:32:16 0 [Note] mysqld: ready for connections.
Instead of having you go classify a bunch of repos manually, we’re going to assign a bunch of repos a random maintenance status. This will of course produce a meaningless model. If you want a model that’s accurate you should go classify repos for real.
The randomly assigned maintenance status and URL for the repo will be stored in
the database. We need to install the dffml-source-mysql
plugin to use
MariaDB/MySQL with DFFML.
$ python -m pip install -U dffml-source-mysql
To get our dummy data, we’ll be using the GitHub v4 API to search for “todo”. The search should return repos implementing a TODO app.
You’ll need a Personal Access Token to be able to make calls to GitHub’s API. You can create one by following their documentation.
When it presents you with a bunch of checkboxes for difference “scopes” you don’t have to check any of them.
$ export GITHUB_TOKEN=<paste your personal access token here>
You’ve just pasted your token into your terminal so it will likely show up in your shell’s history. You might want to either remove it from your history, or just delete the token on GitHub’s settings page after you’re done with this tutorial.
Now we’ll write a function to pull the first 10 repo URLs that result from our “TODO” search. The type annotations are important here, make sure you copy them exactly. 0 means unmaintained, 1 means maintained.
github_search.py
import json
import random
import urllib.request
from typing import Dict, Any
GITHUB_API_URL: str = "https://api.github.com/graphql"
GITHUB_SEARCH: str = json.dumps(
{
"query": """
query {
search(query: "%s", type: REPOSITORY, first: 10) {
nodes {
... on Repository {
url
}
}
}
}
"""
}
)
def get_repos(query: str, github_token: str) -> dict:
# Repos organized by their URL with their value being their feature data
repos: Dict[str, Dict[str, Any]] = {}
# Make the request to the GitHub API
with urllib.request.urlopen(
urllib.request.Request(
GITHUB_API_URL, headers={"Authorization": f"Bearer {github_token}"}
),
data=(GITHUB_SEARCH % (query,)).encode(),
) as response:
# Loop through the 100 repos that were returned
for node in json.load(response)["data"]["search"]["nodes"]:
# Randomly assign a maintenance status for demo purposes
repos[node["url"]] = {
"features": {"maintained": random.choice([0, 1])}
}
return repos
We’ll use this function as a Source. In DFFML a source is somewhere a dataset is stored. In this case the source is dynamic, it’s pulling from an API and randomly assigning a classification. You could even modify it to ask the user for their manual classification instead of randomly assigning one.
We use the merge command to take data from one source and merge it with data in
another source. In our case we’ll be taking records from the get_repos
function within github_search.py and moving them into our database.
The op
source allows us to use any OperatationImplementation
(a Python
function) as a source of records.
We’ll be using the dffml merge
command to take the search results from the
GitHub API and putting them with their randomly assigned status into our
database.
$ dffml merge github=op db=mysql \
-source-github-opimp github_search:get_repos \
-source-github-args todo $GITHUB_TOKEN \
-source-db-insecure \
-source-db-user user \
-source-db-password pass \
-source-db-db db \
-source-db-key key \
-source-db-init \
'CREATE TABLE IF NOT EXISTS `status` (
`key` varchar(767) NOT NULL,
`maintained` TINYINT,
PRIMARY KEY (`key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;' \
-source-db-records \
'SELECT `key`, `maintained` FROM `status`' \
-source-db-record \
'SELECT `key`, `maintained` FROM `status` WHERE `key`=%s' \
-source-db-update \
'INSERT INTO `status` (`key`, `maintained`) VALUES(%s, %s) ON DUPLICATE KEY UPDATE `maintained`=%s' \
-source-db-features '{"maintained": "maintained"}' \
-source-db-predictions '{}'
Now that we have a database running and have our data in the database, we’re ready to start the CGI server. You’ll want to do this in another terminal. The last argument is the port to serve on, change that number if you already have something running on that port.
$ . .venv/bin/activate
$ python3 -m http.server --cgi 8000
You can see all the records and their statuses that we imported into the database by calling the API.
$ curl -v 'http://127.0.0.1:8000/cgi-bin/api.py?action=dump' | \
python3 -m json.tool
Gathering Data¶
We now have our dataset of git repos and their maintenance statuses in our web app. If you go to http://127.0.0.1:8000/ you should see it.
Before we can train a model on this dataset, we need to transform it into something that could be feed to a model. Right now all we have is URLs to git repos.
We’re going to run these repos through DFFML’s data flow orchestrator. We’ll
tell the orchestrator to run certain Operations
on each one of the URLs.
Those Operations
will scrape data that is representative of the repo’s
maintenance, we can then feed that data into a machine learning model.
Note
This example is centered around machine learning. We’ll be using DataFlows to collect a dataset for training, testing, and prediction. For a deeper understanding of Operations and DataFlows, see the Meta Static Analysis.
The operations we’re going to use are a part of dffml-feature-git
, which is
a separate Python package from DFFML which we can install via pip
. We’ll
also use the yaml
configloader, since that creates more user friendly
configs than json
.
$ python -m pip install -U dffml-feature-git dffml-config-yaml
The git operations / features rely on tokei
. We need to download and install
it first.
$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-unknown-linux-gnu.tar.gz' \
| tar -xvz -C .venv/bin/
$ curl -sSL 'https://github.com/XAMPPRocky/tokei/releases/download/v10.1.1/tokei-v10.1.1-x86_64-apple-darwin.tar.gz' \
| tar -xvz -C .venv/bin/
Operations are just Python functions, or classes. They define a routine which
will be run concurrently with other operations. Here’s an example of the
git_commits
operation, which will find the number of commits within a date
range. This function / operation was installed when you installed the
dffml-feature-git
package via pip
.
1@op(
2 inputs={
3 "repo": git_repository,
4 "branch": git_branch,
5 "start_end": date_pair,
6 },
7 outputs={"commits": commit_count},
8)
9async def git_commits(repo: Dict[str, str], branch: str, start_end: List[str]):
10 start, end = start_end
11 commit_count = 0
12 proc = await create(
13 "git",
14 "log",
15 "--oneline",
16 "--before",
17 start,
18 "--after",
19 end,
20 branch,
21 cwd=repo.directory,
22 )
23 while not proc.stdout.at_eof():
24 line = await proc.stdout.readline()
25 if line != b"":
26 commit_count += 1
27 await stop(proc)
28 return {"commits": commit_count}
We’re going to use the operations provided in dffml-feature-git
to gather
our dataset. The following command creates a DataFlow
description of how
all the operations within dffml-feature-git
link together. The DataFlow
is stored in the YAML file dataflow.yaml.
$ dffml dataflow create \
-configloader yaml \
-inputs \
10=quarters \
true=no_git_branch_given \
'{"authors": {"group": "author_count", "by": "quarter"},
"commits": {"group": "commit_count", "by": "quarter"},
"work": {"group": "work_spread", "by": "quarter"}}'=group_by_spec \
-- \
group_by \
make_quarters \
quarters_back_to_date \
check_if_valid_git_repository_URL \
clone_git_repo \
git_repo_default_branch \
git_repo_commit_from_date \
git_repo_author_lines_for_dates \
work \
git_commits \
count_authors \
cleanup_git_repo \
| tee dataflow.yaml \
| tail -n 15
seed:
- definition: quarters
value: 10
- definition: group_by_spec
value:
authors:
by: quarter
group: author_count
commits:
by: quarter
group: commit_count
work:
by: quarter
group: work_spread
Since operations are run concurrently with each other, DFFML manages locking of
input data, such as git repositories. This is done via Definitions
which are
the variables used as values in the input
and output
dictionaries. We
link together all of the operations in a DataFlow
. The pink boxes in the
diagram below are the inputs to the network. The purple are the different
operations. Arrows show how data moves between operations.
We can also visualize how the individual inputs and outputs are linked together.
Inputs and outputs of the same Definition
can be linked together
automatically.
The inputs and outputs of operations within a running DataFlow are organized by contexts. The context for our dataset generation will be the source URL to the Git repo.
We will be providing the source URL to the git repo on a per-repo basis.
We provide the start date of the zeroith quarter, and 10 instances of quarter, since for each operation, every possible permutation of inputs will be run,
quarters_back_to_date
is going to take the start date, and the quarter, and produce a date range for that quarter. We usemake_quarters
and pass10=quarters
to make ten instances ofquarter
.We’ll also need to provide an input to the output operation
group_by_spec
. Output operations decide what data generated should be used as feature data, and present it in a usable format.Here we’re telling the
group_by
operation to create feature data where theauthor_count
,work_spread
andcommit_count
are grouped by the quarter they were generated for.
Training our Model¶
The model we’ll be using is a part of dffml-model-tensorflow
, which is
another separate Python package from DFFML which we can install via pip
.
$ python -m pip install -U dffml-model-tensorflow
The model is a generic wrapper around Tensorflow’s DNN estimator. We can use it to train on our dataset.
We’re going to put the training command in it’s own file, since it’s very long
We use the DataFlowPreprocessSource
to run
the DataFlow we created using the above dffml dataflow create
command on
each repo. When we run the DataFlow we pass it the current date and tell it to
use the record’s key as the repo URL (since that’s what the key is).
train.sh
#!/usr/bin/env bash
# echo command back and exit on error
set -xe
dffml train all \
-model tfdnnc \
-model-epochs 400 \
-model-steps 4000 \
-model-predict maintained:int:1 \
-model-classifications 0 1 \
-model-location modeldir \
-model-features \
authors:int:10 \
commits:int:10 \
work:int:10 \
-sources preprocess=dfpreprocess \
-source-preprocess-dataflow dataflow.yaml \
-source-preprocess-no_strict \
-source-preprocess-record_def URL \
-source-preprocess-inputs "$(date +'%Y-%m-%d %H:%M')=quarter_start_date" \
-source-preprocess-features maintained:int:1 \
-source-preprocess-source mysql \
-source-preprocess-source-insecure \
-source-preprocess-source-user user \
-source-preprocess-source-password pass \
-source-preprocess-source-db db \
-source-preprocess-source-key key \
-source-preprocess-source-records \
'SELECT `key`, `maintained` FROM `status`' \
-source-preprocess-source-record \
'SELECT `key`, `maintained` FROM `status` WHERE `key`=%s' \
-source-preprocess-source-update \
'INSERT INTO `status` (`key`, `maintained`) VALUES(%s, %s) ON DUPLICATE KEY UPDATE `maintained`=%s' \
-source-preprocess-source-features '{"maintained": "maintained"}' \
-source-preprocess-source-predictions '{}' \
-log debug
Run train.sh to train the model
The speed of the following command depends on your internet connection, it may
take 2 minutes, it may take more. All the git repos in the database will be
downloaded, this will also take up space in /tmp
, they will be cleaned up
automatically.
$ bash train.sh
Making a Prediction¶
Run the operations on the new repo: https://github.com/intel/dffml
and
have the model make a prediction
We’re going to put the prediction command in it’s own file, since it’s very long
predict.sh
#!/usr/bin/env bash
# echo command back and exit on error
set -xe
# Usage: ./predict.sh https://github.com/org/repo
URL="${1}"
dffml predict record \
-update \
-log debug \
-keys "${URL}" \
-model tfdnnc \
-model-predict maintained:int:1 \
-model-classifications 0 1 \
-model-location modeldir \
-model-features \
authors:int:10 \
commits:int:10 \
work:int:10 \
-sources preprocess=dfpreprocess \
-source-preprocess-dataflow dataflow.yaml \
-source-preprocess-record_def URL \
-source-preprocess-inputs "$(date +'%Y-%m-%d %H:%M')=quarter_start_date" \
-source-preprocess-source mysql \
-source-preprocess-source-insecure \
-source-preprocess-source-user user \
-source-preprocess-source-password pass \
-source-preprocess-source-db db \
-source-preprocess-source-key key \
-source-preprocess-source-records \
'SELECT `key`, `maintained` FROM `status`' \
-source-preprocess-source-record \
'SELECT `key`, `maintained` FROM `status` WHERE `key`=%s' \
-source-preprocess-source-update \
'INSERT INTO `status` (`key`, `maintained`) VALUES(%s, %s) ON DUPLICATE KEY UPDATE `maintained`=%s' \
-source-preprocess-source-features '{}' \
-source-preprocess-source-predictions '{"maintained": ("maintained", None)}'
Run predict.sh to make a prediction using the model. We’re asking the model to make a prediction on the DFFML repo.
$ bash predict.sh https://github.com/intel/dffml
[
{
"extra": {},
"features": {
"authors": [
9,
16,
20,
14,
10,
4,
5,
0,
0,
0
],
"commits": [
110,
273,
252,
105,
65,
64,
51,
0,
0,
0
],
"work": [
75,
82,
73,
34,
56,
3,
5,
0,
0,
0
]
},
"key": "https://github.com/intel/dffml",
"last_updated": "2020-10-12T21:15:13Z",
"prediction": {
"maintained": {
"confidence": 0.9999271631240845,
"value": "1"
}
}
}
]
Modifying the Legacy App¶
Now let’s add a ‘Predict’ button to the app. The button will trigger the operations to be run on the new URL, and then the prediction. The demo app will then take the predicted classification and record that as the classification in the database.
The predict button will trigger an HTTP call to the CGI API. The CGI script will run the same commands we just ran on the command line, and parse the output, which is in JSON format, and set the resulting prediction classification in the database.
We modify the backend CGI Python API to have it call our predict.sh script.
cgi-bin/api.py
--- /home/runner/work/dffml/dffml/examples/maintained/cgi-bin/api.py
+++ /home/runner/work/dffml/dffml/examples/maintained/cgi-bin/api-ml.py
@@ -2,6 +2,7 @@
import os
import sys
import json
+import subprocess
import urllib.parse
import mysql.connector
@@ -31,6 +32,14 @@
)
cnx.commit()
print(json.dumps({"success": True}))
+elif action == "predict":
+ # Set current working directory (cwd) to the parent directory of cgi-bin
+ print(
+ subprocess.check_output(
+ ["bash", "predict.sh", query["URL"]],
+ cwd=os.path.join(os.path.dirname(__file__), ".."),
+ ).decode()
+ )
else:
print(json.dumps({"error": "Unknown action"}))
Test the new prediction capabilities from the command line with curl
$ curl -v 'http://127.0.0.1:8000/cgi-bin/api.py?action=predict&URL=https://github.com/intel/dffml' | \
python -m json.tool
Hook up the predict button to call our new API by adding an event listener when the Predict button is clicked.
ml.js
function predict(URL) {
return fetch('cgi-bin/api.py?action=predict&URL=' + URL)
.then(function(response) {
return response.json()
});
}
window.addEventListener('DOMContentLoaded', function(event) {
var tableDOM = document.getElementById('table');
var URLDOM = document.getElementById('URL');
var predictDOM = document.getElementById('predict');
predictDOM.addEventListener('click', function(event) {
predict(URLDOM.value)
.then(function() {
refreshTable(tableDOM);
});
});
});
We need to import the new script into the main page and add the HTML for the predict button.
index.html
--- /home/runner/work/dffml/dffml/examples/maintained/index.html
+++ /home/runner/work/dffml/dffml/examples/maintained/ml.html
@@ -3,6 +3,7 @@
<head>
<title>Git Repo Maintenance Tracker</title>
<script type="text/javascript" src="main.js"></script>
+ <script type="text/javascript" src="ml.js"></script>
<link rel="stylesheet" type="text/css" href="theme.css">
</head>
<body>
@@ -15,6 +16,7 @@
<input id="URL" placeholder="git://server.git/repo"></input>
<button id="maintained" class="good">Maintained</button>
<button id="unmaintained" class="dangerous">Unmaintained</button>
+ <button id="predict">Predict</button>
</div>
<br />
Visit the web app, and input a Git repo to evaluate into the input field. Then click predict. The demo gif has had all the entries DROPed from the database. But you can look through the table and you’ll find that a prediction has been run on the repo. Congratulations! You’ve automated a manual classification process, and integrated machine learning into a legacy application.