InnerSource Microservice

We created a crawler in the previous tutorial InnerSource Portal. Now we are going to deploy a micro service which evaluates a single repo at a time using the same DataFlow we used for the crawler.

Our end result will be a container which serves a JSON API endpoint. We can send a request to the endpoint to evaluate metrics for an InnerSource repo.

Config Files

As we’ve seen before, DataFlows can be serialized to config files. JSON representations of DataFlows are not fun to hand edit. YAML looks a lot cleaner.

We’re going to install the dffml-config-yaml package so that we don’t have to look at JSON.

$ python -m pip install dffml-config-yaml

HTTP Service

We’re going to install the dffml-service-http package which will be the server process of our microservice.

$ python -m pip install dffml-service-http

To deploy our previous InnerSource crawler dataflow via the HTTP API, we need to register a communication channel, which is the association of a URL path to the dataflow.

We create a config file for the MultiComm we’ll be using. MultiComm config files go under the mc directory of the directory being used to deploy. Then config file itself then goes under the name of the MultiComm its associated with, http in this instance.

The HTTP service provides multiple channels of communication which we can attach DataFlows to. These end up being URL paths in the case of the HTTP server.

We need to create a directory for the URL to DataFlow configuration mappings to live in.

$ mkdir -p mc/http

The file is populated with the URL path that should trigger the dataflow, how to present the output data, and if the dataflow should return when all outputs exist, or if it should continue waiting for more inputs (asynchronous, used for websockets / http2).

mc/http/metrics.yaml

path: /metrics
output_mode: json
asynchronous: false

We also need to create a directory for the DataFlow to live in.

$ mkdir df/

Querying GitHub

Create a directory where we’ll store all of the operations (Python functions) we’ll use to gather project data / metrics.

$ mkdir operations/

Make it a Python module by creating a blank __init__.py file in it.

$ touch operations/__init__.py

Install the PyGithub library, which we’ll use to access the GitHub API.

$ python -m pip install PyGithub

You’ll need a Personal Access Token to be able to make calls to GitHub’s API. You can create one by following their documentation.

When it presents you with a bunch of checkboxes for difference “scopes” you don’t have to check any of them, unless you want to access your own private repos, then check the repos box.

$ export GITHUB_TOKEN=<paste your personal access token here>

You’ve just pasted your token into your terminal so it will likely show up in your shell’s history. You might want to either remove it from your history, or just delete the token on GitHub’s settings page after you’re done with this tutorial.

Write a Python function which returns an object representing a GitHub repo. For simplicity of this tutorial, the function will take the token from the environment variable we just set.

operations/gh.py

import os

import dffml
import github


@dffml.op(
    inputs={
        "url": dffml.Definition(name="github.repo.url", primitive="string"),
    },
    outputs={
        "owner": dffml.Definition(
            name="github.org.owner_name", primitive="string"
        ),
        "project": dffml.Definition(
            name="github.repo.project_name", primitive="string"
        ),
    },
)
def github_split_owner_project(url):
    """
    Parses the owner and project name out of a GitHub URL

    Examples
    --------

    >>> github_split_owner_project("https://github.com/intel/dffml")
    ('intel', 'dffml')
    """
    return dict(
        zip(
            ("owner", "project"),
            tuple("/".join(url.split("/")[-2:]).split("/")),
        )
    )


@dffml.op(
    inputs={
        "org": github_split_owner_project.op.outputs["owner"],
        "project": github_split_owner_project.op.outputs["project"],
    },
    outputs={
        "repo": dffml.Definition(
            name="PyGithub.Repository", primitive="object",
        ),
    },
)
def github_get_repo(org, project):
    # Instantiate a GitHub API object
    g = github.Github(os.environ["GITHUB_TOKEN"])
    # Make the request for the repo
    return {"repo": g.get_repo(f"{org}/{project}")}


@dffml.op(
    inputs={"repo": github_get_repo.op.outputs["repo"],},
    outputs={
        "raw_repo": dffml.Definition(
            name="PyGithub.Repository.Raw", primitive="object"
        ),
    },
)
def github_repo_raw(repo):
    return {"raw_repo": repo._rawData}


# If this script is run via `python gh.py intel dffml`, it will print out the
# repo data using the pprint module.
if __name__ == "__main__":
    import sys
    import pprint

    pprint.pprint(
        github_repo_raw(github_get_repo(sys.argv[-2], sys.argv[-1])["repo"])
    )

You’ll notice that we wrote a function, and then put an if statement. The if block let’s us only run the code within the block when the script is run directly (rather than when included via import).

If we run Python on the script, and pass an org name followed by a repo name, our if block will run the function and print the raw data of the repsonse received from GitHub, containing a bunch of information about the repo.

You’ll notice that the data being output here is a superset of the data we’d see for the repo in the repos.json file. Meaning we have all the required data and more.

$ python operations/gh.py intel dffml
{'allow_auto_merge': False,
 <... output clipped ...>
 'full_name': 'intel/dffml',
 <... output clipped ...>
 'html_url': 'https://github.com/intel/dffml',
 <... output clipped ...>
 'watchers_count': 135}

DataFlow

We’re going to create a Python script which will use all the operations we’ve written.

We need to download the repos.json file from the previous example so that we know what fields our DataFlow should output.

$ curl -fLo repos.json.bak https://github.com/SAP/project-portal-for-innersource/raw/main/repos.json

First we declare imports of other packages.

dataflow.py

import sys
import json
import asyncio
import pathlib

import dffml

Then we import our operations.

dataflow.py

# Import all the operations we implemented into this file's global namespace
from operations.gh import *

Finally we define our dataflow.

dataflow.py

# Read in the repos.json backup to learn it's format. It's in the same directory
# as this file (dataflow.py) so we can reference it by looking in the parent
# directory of this file and then down (via .joinpath()) into repos.json.bak
repos_json_bak_path = pathlib.Path(__file__).parent.joinpath("repos.json.bak")
# Read in the contents
repos_json_bak_text = repos_json_bak_path.read_text()
# Parse the contents
repos_json_bak = json.loads(repos_json_bak_text)
# We'll inspect the first element in the list to find out what keys must be
# present in the object
required_data_top_level = list(repos_json_bak[0].keys())
# It should look like this:
#   required_data_top_level = [
#       'id', 'name', 'full_name', 'html_url', 'description', 'created_at',
#       'updated_at', 'pushed_at', 'stargazers_count', 'watchers_count',
#       'language', 'forks_count', 'open_issues_count', 'license',
#       'default_branch', 'owner', '_InnerSourceMetadata'
#   ]
# We're first going to create output operations to grab each of the keys
# We know that _InnerSourceMetadata is not a single value, so we'll handle that
# separately and remove it from our list
required_data_top_level.remove("_InnerSourceMetadata")

# Make a list of any imported OpeartionImplementations (functions decorated with
# @op()) from any that are in the global namespace of this file
operation_implementations = dffml.opimp_in(sys.modules[__name__])

# Create a DataFlow using every operation in all the modules we imported. Also
# use the remap operation
dataflow = dffml.DataFlow(
    dffml.remap,
    *operation_implementations,
    # The remap operation allows us to specify which keys should appear in the
    # outputs of each dataflow run. We do that by configuring it to use a
    # subflow, which is a dataflow run within a dataflow.
    # TODO(pdxjohnny) Remove .export() after unifying config code.
    configs={
        dffml.remap.op.name: {
            # Our subflow will run the get_single operation, which grabs one
            # Input object matching the given definition name. The one Input we
            # grab at first is the raw ouput of the PyGithub Repository object.
            "dataflow": dffml.DataFlow(
                dffml.GetSingle,
                seed=[
                    dffml.Input(
                        value=[github_repo_raw.op.outputs["raw_repo"].name],
                        definition=dffml.GetSingle.op.inputs["spec"],
                    )
                ],
            ).export()
        }
    },
    seed=[
        dffml.Input(
            # The output of the top level dataflow will be a dict where the keys
            # are what we give here, and the values are the output of a call to
            # traverse_get(), where the keys to traverse are the values we give
            # here, and the dict being traversed the results from the subflow.
            # {key: traverse_get(subflow_results, *value)}
            value={
                key: [github_repo_raw.op.outputs["raw_repo"].name, key]
                for key in required_data_top_level
            },
            definition=dffml.remap.op.inputs["spec"],
        )
    ],
)

We export the dataflow for use with the CLI, HTTP service, etc.

TODO Add link to webui when complete. It will be used for editing dataflows. ETA Oct 2021.

$ dffml service dev export dataflow:dataflow | tee df/metrics.json

We can run the dataflow using the DFFML command line interface rather than running the Python file.

If you want to run the dataflow on a single repo, you can do it as follows.

$ dffml dataflow run records set \
    -dataflow df/metrics.json \
    -record-def "github.repo.url" \
    -keys \
      https://github.com/intel/dffml

Serving the DataFlow

Warning

The -insecure flag is only being used here to speed up this tutorial. See documentation on HTTP API Security for more information.

We now start the http server and tell it that the MultiComm configuration directory (mc/) can be found in the current directory, ..

$ dffml service http server -port 8080 -insecure -mc-config .

In another terminal, you can send a POST request containing the Input items that you want evaluated.

$ curl -sf \
  --header "Content-Type: application/json" \
  --request POST \
  --data '{"https://github.com/intel/dffml": [{"value":"https://github.com/intel/dffml","definition":"github.repo.url"}]}' \
  http://localhost:8080/metrics | python -m json.tool
            [
    {
        "extra": {},
        "features": {
            "created_at": "2018-09-19T21:06:34Z",
            "default_branch": "main",
            "description": "The easiest way to use Machine Learning. Mix and match underlying ML libraries and data set sources. Generate new datasets or modify existing ones with ease.",
            "forks_count": 118,
            "full_name": "intel/dffml",
            "html_url": "https://github.com/intel/dffml",
            "id": 149512216,
            "language": "Python",
            "license": {
                "key": "mit",
                "name": "MIT License",
                "node_id": "MDc6TGljZW5zZTEz",
                "spdx_id": "MIT",
                "url": "https://api.github.com/licenses/mit"
            },
            "name": "dffml",
            "open_issues_count": 296,
            "owner": {
                "avatar_url": "https://avatars.githubusercontent.com/u/17888862?v=4",
                "events_url": "https://api.github.com/users/intel/events{/privacy}",
                "followers_url": "https://api.github.com/users/intel/followers",
                "following_url": "https://api.github.com/users/intel/following{/other_user}",
                "gists_url": "https://api.github.com/users/intel/gists{/gist_id}",
                "gravatar_id": "",
                "html_url": "https://github.com/intel",
                "id": 17888862,
                "login": "intel",
                "node_id": "MDEyOk9yZ2FuaXphdGlvbjE3ODg4ODYy",
                "organizations_url": "https://api.github.com/users/intel/orgs",
                "received_events_url": "https://api.github.com/users/intel/received_events",
                "repos_url": "https://api.github.com/users/intel/repos",
                "site_admin": false,
                "starred_url": "https://api.github.com/users/intel/starred{/owner}{/repo}",
                "subscriptions_url": "https://api.github.com/users/intel/subscriptions",
                "type": "Organization",
                "url": "https://api.github.com/users/intel"
            },
            "pushed_at": "2021-09-17T03:31:18Z",
            "stargazers_count": 143,
            "updated_at": "2021-08-31T16:20:16Z",
            "watchers_count": 143
        },
        "key": "https://github.com/intel/dffml",
        "last_updated": "2021-09-17T09:39:30Z"
    }
]