InnerSource Microservice¶
We created a crawler in the previous tutorial InnerSource Portal. Now we are going to deploy a micro service which evaluates a single repo at a time using the same DataFlow we used for the crawler.
Our end result will be a container which serves a JSON API endpoint. We can send a request to the endpoint to evaluate metrics for an InnerSource repo.
Config Files¶
As we’ve seen before, DataFlows can be serialized to config files. JSON representations of DataFlows are not fun to hand edit. YAML looks a lot cleaner.
We’re going to install the dffml-config-yaml
package so that we don’t have
to look at JSON.
$ python -m pip install dffml-config-yaml
HTTP Service¶
We’re going to install the dffml-service-http
package which will be the
server process of our microservice.
$ python -m pip install dffml-service-http
To deploy our previous InnerSource crawler dataflow via the HTTP API, we need to register a communication channel, which is the association of a URL path to the dataflow.
We create a config file for the MultiComm
we’ll be using. MultiComm
config files go under the mc
directory of the directory being
used to deploy. Then config file itself then goes under the name of the
MultiComm
its associated with, http
in this instance.
The HTTP service provides multiple channels of communication which we can attach DataFlows to. These end up being URL paths in the case of the HTTP server.
We need to create a directory for the URL to DataFlow configuration mappings to live in.
$ mkdir -p mc/http
The file is populated with the URL path that should trigger the dataflow, how to
present the output data, and if the dataflow should return when all outputs
exist, or if it should continue waiting for more inputs (asynchronous
, used
for websockets / http2).
mc/http/metrics.yaml
path: /metrics
output_mode: json
asynchronous: false
We also need to create a directory for the DataFlow to live in.
$ mkdir df/
Querying GitHub¶
Create a directory where we’ll store all of the operations (Python functions) we’ll use to gather project data / metrics.
$ mkdir operations/
Make it a Python module by creating a blank __init__.py
file in it.
$ touch operations/__init__.py
Install the PyGithub library, which we’ll use to access the GitHub API.
$ python -m pip install PyGithub
You’ll need a Personal Access Token to be able to make calls to GitHub’s API. You can create one by following their documentation.
When it presents you with a bunch of checkboxes for difference “scopes” you don’t have to check any of them, unless you want to access your own private repos, then check the repos box.
$ export GITHUB_TOKEN=<paste your personal access token here>
You’ve just pasted your token into your terminal so it will likely show up in your shell’s history. You might want to either remove it from your history, or just delete the token on GitHub’s settings page after you’re done with this tutorial.
Write a Python function which returns an object representing a GitHub repo. For simplicity of this tutorial, the function will take the token from the environment variable we just set.
operations/gh.py
import os
import dffml
import github
@dffml.op(
inputs={
"url": dffml.Definition(name="github.repo.url", primitive="string"),
},
outputs={
"owner": dffml.Definition(
name="github.org.owner_name", primitive="string"
),
"project": dffml.Definition(
name="github.repo.project_name", primitive="string"
),
},
)
def github_split_owner_project(url):
"""
Parses the owner and project name out of a GitHub URL
Examples
--------
>>> github_split_owner_project("https://github.com/intel/dffml")
('intel', 'dffml')
"""
return dict(
zip(
("owner", "project"),
tuple("/".join(url.split("/")[-2:]).split("/")),
)
)
@dffml.op(
inputs={
"org": github_split_owner_project.op.outputs["owner"],
"project": github_split_owner_project.op.outputs["project"],
},
outputs={
"repo": dffml.Definition(
name="PyGithub.Repository", primitive="object",
),
},
)
def github_get_repo(org, project):
# Instantiate a GitHub API object
g = github.Github(os.environ["GITHUB_TOKEN"])
# Make the request for the repo
return {"repo": g.get_repo(f"{org}/{project}")}
@dffml.op(
inputs={"repo": github_get_repo.op.outputs["repo"],},
outputs={
"raw_repo": dffml.Definition(
name="PyGithub.Repository.Raw", primitive="object"
),
},
)
def github_repo_raw(repo):
return {"raw_repo": repo._rawData}
# If this script is run via `python gh.py intel dffml`, it will print out the
# repo data using the pprint module.
if __name__ == "__main__":
import sys
import pprint
pprint.pprint(
github_repo_raw(github_get_repo(sys.argv[-2], sys.argv[-1])["repo"])
)
You’ll notice that we wrote a function, and then put an if
statement. The
if
block let’s us only run the code within the block when the script is run
directly (rather than when included via import
).
If we run Python on the script, and pass an org name followed by a repo name,
our if
block will run the function and print the raw data of the repsonse
received from GitHub, containing a bunch of information about the repo.
You’ll notice that the data being output here is a superset of the data we’d see
for the repo in the repos.json
file. Meaning we have all the required data
and more.
$ python operations/gh.py intel dffml
{'allow_auto_merge': False,
<... output clipped ...>
'full_name': 'intel/dffml',
<... output clipped ...>
'html_url': 'https://github.com/intel/dffml',
<... output clipped ...>
'watchers_count': 135}
DataFlow¶
We’re going to create a Python script which will use all the operations we’ve written.
We need to download the repos.json
file from the previous example so that we
know what fields our DataFlow should output.
$ curl -fLo repos.json.bak https://github.com/SAP/project-portal-for-innersource/raw/main/repos.json
First we declare imports of other packages.
dataflow.py
import sys
import json
import asyncio
import pathlib
import dffml
Then we import our operations.
dataflow.py
# Import all the operations we implemented into this file's global namespace
from operations.gh import *
Finally we define our dataflow.
dataflow.py
# Read in the repos.json backup to learn it's format. It's in the same directory
# as this file (dataflow.py) so we can reference it by looking in the parent
# directory of this file and then down (via .joinpath()) into repos.json.bak
repos_json_bak_path = pathlib.Path(__file__).parent.joinpath("repos.json.bak")
# Read in the contents
repos_json_bak_text = repos_json_bak_path.read_text()
# Parse the contents
repos_json_bak = json.loads(repos_json_bak_text)
# We'll inspect the first element in the list to find out what keys must be
# present in the object
required_data_top_level = list(repos_json_bak[0].keys())
# It should look like this:
# required_data_top_level = [
# 'id', 'name', 'full_name', 'html_url', 'description', 'created_at',
# 'updated_at', 'pushed_at', 'stargazers_count', 'watchers_count',
# 'language', 'forks_count', 'open_issues_count', 'license',
# 'default_branch', 'owner', '_InnerSourceMetadata'
# ]
# We're first going to create output operations to grab each of the keys
# We know that _InnerSourceMetadata is not a single value, so we'll handle that
# separately and remove it from our list
required_data_top_level.remove("_InnerSourceMetadata")
# Make a list of any imported OpeartionImplementations (functions decorated with
# @op()) from any that are in the global namespace of this file
operation_implementations = dffml.opimp_in(sys.modules[__name__])
# Create a DataFlow using every operation in all the modules we imported. Also
# use the remap operation
dataflow = dffml.DataFlow(
dffml.remap,
*operation_implementations,
# The remap operation allows us to specify which keys should appear in the
# outputs of each dataflow run. We do that by configuring it to use a
# subflow, which is a dataflow run within a dataflow.
# TODO(pdxjohnny) Remove .export() after unifying config code.
configs={
dffml.remap.op.name: {
# Our subflow will run the get_single operation, which grabs one
# Input object matching the given definition name. The one Input we
# grab at first is the raw ouput of the PyGithub Repository object.
"dataflow": dffml.DataFlow(
dffml.GetSingle,
seed=[
dffml.Input(
value=[github_repo_raw.op.outputs["raw_repo"].name],
definition=dffml.GetSingle.op.inputs["spec"],
)
],
).export()
}
},
seed=[
dffml.Input(
# The output of the top level dataflow will be a dict where the keys
# are what we give here, and the values are the output of a call to
# traverse_get(), where the keys to traverse are the values we give
# here, and the dict being traversed the results from the subflow.
# {key: traverse_get(subflow_results, *value)}
value={
key: [github_repo_raw.op.outputs["raw_repo"].name, key]
for key in required_data_top_level
},
definition=dffml.remap.op.inputs["spec"],
)
],
)
We export the dataflow for use with the CLI, HTTP service, etc.
TODO Add link to webui when complete. It will be used for editing dataflows. ETA Oct 2021.
$ dffml service dev export dataflow:dataflow | tee df/metrics.json
We can run the dataflow using the DFFML command line interface rather than running the Python file.
If you want to run the dataflow on a single repo, you can do it as follows.
$ dffml dataflow run records set \
-dataflow df/metrics.json \
-record-def "github.repo.url" \
-keys \
https://github.com/intel/dffml
Serving the DataFlow¶
Warning
The -insecure
flag is only being used here to speed up this
tutorial. See documentation on HTTP API
Security for more information.
We now start the http server and tell it that the MultiComm
configuration
directory (mc/
) can be found in the current directory, .
.
$ dffml service http server -port 8080 -insecure -mc-config .
In another terminal, you can send a POST
request containing the Input
items that you want evaluated.
$ curl -sf \
--header "Content-Type: application/json" \
--request POST \
--data '{"https://github.com/intel/dffml": [{"value":"https://github.com/intel/dffml","definition":"github.repo.url"}]}' \
http://localhost:8080/metrics | python -m json.tool
[
{
"extra": {},
"features": {
"created_at": "2018-09-19T21:06:34Z",
"default_branch": "main",
"description": "The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.",
"forks_count": 118,
"full_name": "intel/dffml",
"html_url": "https://github.com/intel/dffml",
"id": 149512216,
"language": "Python",
"license": {
"key": "mit",
"name": "MIT License",
"node_id": "MDc6TGljZW5zZTEz",
"spdx_id": "MIT",
"url": "https://api.github.com/licenses/mit"
},
"name": "dffml",
"open_issues_count": 296,
"owner": {
"avatar_url": "https://avatars.githubusercontent.com/u/17888862?v=4",
"events_url": "https://api.github.com/users/intel/events{/privacy}",
"followers_url": "https://api.github.com/users/intel/followers",
"following_url": "https://api.github.com/users/intel/following{/other_user}",
"gists_url": "https://api.github.com/users/intel/gists{/gist_id}",
"gravatar_id": "",
"html_url": "https://github.com/intel",
"id": 17888862,
"login": "intel",
"node_id": "MDEyOk9yZ2FuaXphdGlvbjE3ODg4ODYy",
"organizations_url": "https://api.github.com/users/intel/orgs",
"received_events_url": "https://api.github.com/users/intel/received_events",
"repos_url": "https://api.github.com/users/intel/repos",
"site_admin": false,
"starred_url": "https://api.github.com/users/intel/starred{/owner}{/repo}",
"subscriptions_url": "https://api.github.com/users/intel/subscriptions",
"type": "Organization",
"url": "https://api.github.com/users/intel"
},
"pushed_at": "2021-09-17T03:31:18Z",
"stargazers_count": 143,
"updated_at": "2021-08-31T16:20:16Z",
"watchers_count": 143
},
"key": "https://github.com/intel/dffml",
"last_updated": "2021-09-17T09:39:30Z"
}
]