About

Data Flow Facilitator for Machine Learning (DFFML) makes it easy to generate datasets, train and use machine learning models, and integrate machine learning into new or existing applications. It provides APIs for dataset generation, storage, and model definition.

  • Models handle implementations of machine learning algorithms. Likely wrapping code from a popular machine learning framework.

  • Sources handle the storage of datasets, saving and loading them from files, databases, remote APIs, etc.

  • DataFlows are directed graphs used to generate a dataset, as well as modify existing datasets. They can also be used to do non-machine learning tasks, you could use them to build a web app for instance.

You’ll find the existing implementations of all of these on their respective Plugins pages. DFFML has a plugin based architecture, which allows us to include some sources, models, and operations as a part of the main package, dffml, and other functionality in more specific packages.

Mission Statement

DFFML aims to be the easiest and most convenient way to use Machine Learning.

  • It’s an AutoML Python library, command line application, and HTTP service.

  • You give it your data and tell it what kind of model you want to train. It creates a model for you.

  • If you want finer grained control over the model, you can easily do so by implementing your own model plugin.

  • We make it easy to use and deploy your models.

  • We provide a directed graph concurrent execution environment with managed locking which we call DataFlows.

  • DataFlows make it easy to generate datasets or modify existing datasets for rapid iteration on feature engineering.

Why DFFML

  • You want a “just bring your data” approach to machine learning.

    • No need to write code if you don’t want to, use popular machine learning libraries via the command line, high level Python abstraction, or HTTP API.

  • You want to do machine learning on a new problem that you don’t have a dataset for, so you need to generate it.

    • Directed Graph Execution lets you write code that runs concurrently with managed locking. Making the feature engineering iteration process very fast.

Architecture

This is a high level overview of how DFFML works.

Architecture Diagram

Machine Learning

Python was chosen because of the machine learning community’s preference towards it. In addition to the data flow side of DFFML, there is a machine learning focused side. It provides a standardized way to defining, training, and using models. It also allows for wrapping existing models so as to expose them via the standardized API. Models can then be integrated into data flows as operations. This enables trivial layering of models to create complex features. See Models for existing models and usage.

Data Flows - Directed Graph Execution

The idea behind this project is to provide a way to link together various new or existing pieces of code and run them via an orchestration engine that forwards the data between them all. Similar a microservice architecture but with the orchestration being preformed according to a directed graph. This offers greater flexibility in that interaction between services can easily be modified without changing code, only the graph (known as the dataflow).

This is an example of the dataflow for a meta static analysis tool for Python, shouldi. We take the package name (package) and feed it through operations, which are just functions (but could be anything, some SaaS web API endpoint for instance). All the data generated by running these operations is query-able, allowing us to structure the output in whatever way is most fitting for our application.

DataFlow for shouldi tool

Consistent API

DFFML decouples the interface through which the flow is accessed from the flow itself. For instance, data flows can be run via the library, HTTP API, CLI, or any communication channel (next targets are Slack and IRC). Data flows are also asynchronous in nature, allowing them to be used to build any event driven application (Chat, IoT data, etc.). The way in which operations are defined and executed by the orchestrator will let us take existing API endpoints and code in other languages and combine them into one cohesive workflow. The architecture itself is programming language agnostic, the first implementation has been written in Python.

Plugins

We take a community driven approach to content. Architecture is plugin based, which means anyone can swap out any piece by writing their own plugin and publishing it to the Python Package Index. This means that developers can publish operations and machine learning models that work out of the box with everything else maintained as a part of the core repository and with other developers models and operations. Tutorials show how to create your own plugins.

Team

We have an awesome team working on the project. We hold weekly meetings and have a mailing list and chat! If you want to get involved, ask questions, or get help getting started, see Contact Us.

We participated in Google Summer of Code 2019 under the Python Software Foundation. A big thanks to our students, Yash and Sudharsana!

Users

The following is a list of organizations and projects using DFFML. Please let us know if you are using DFFML and we’ll add you to the list. If you want help using DFFML, see the Contact Us page.

  • Intel

    • Security analysis of Open Source Software dependencies.