Concepts¶
Here we explain the main concepts of DFFML. How things work, and the philosophies behind why they work the way they do. If anything here is unclear, or you think there’s a more user friendly way to do something, please let us know. See the Contact Us page for how to reach us.
DFFML aims to streamline the training and use of various machine learning models, dataset collection, and storage. As such there are three sides to the DFFML triangle.
Sources abstract the storage of datasets.
Models abstract implementations of machine learning algorithms.
DataFlows are used to generate a datasets, as well as modify existing datasets.
Every part of DFFML is a plugin, which means that you can replace it with your own implementation without needing to modifying DFFML itself. It also means everyone can use each others plugins by publishing them to PyPi, or just creating a public Git repo.
If you don’t see the plugin you need already implemented under Plugins, you can Create it yourself.
Sources¶
A Source
of data might be something like a CSV file, SQL database, or HDFS.
By providing a generic abstraction around where data is being saved and stored, model implementations can access the data via the same API no matter where it is.
Records¶
A common construct within DFFML is the Record
. A Record
object is a
repository of information associated with a unique key. The Record
holds all
the data associated with that key.
Say for instance you generated a dataset that had to do with cities. Your unique
key might be the name of the city, the state or province it’s in, and the
country it’s in. For example: Portland, OR, USA
.
The data associated with a Record
is called the feature data. Its stored
within a key value mapping within the Record
accessible via the
features()
method. Our city example might have the following feature data.
{
"climate": "rainy",
"population": "too many",
"traffic": "heavy"
}
Models¶
A Model
could be any implementation of a machine learning algorithm.
By wrapping various implementations of machine learning algorithms in DFFML’s API, applications using DFFML via its Python library interface, command line interface, or HTTP interface, can benefit from them being wrapped in a similar design pattern. This means that switching from a model implemented with one major framework to another is painless.
DataFlows¶
One can think of the data flow side of DFFML as a event loop running at a high
level of abstraction. Event loop usually refers to waiting for read
,
write
, and error
events on network connections, files, or if you’re in
JavaScript, it might be a click.
The idea behind event loops is that when a new event comes in, it triggers some processing of the data associated with that event.
For DFFML we define data types that we care about, and then define operations (essentially functions) that get run when new data of our defined data types shows up.
One benefit of using the data flow programming abstraction provided by DFFML is
it runs everything concurrently and manages locking of data used by the routines
running concurrently. In addition to running concurrently, asyncio
, which
DFFML makes heavy use of, makes it easy to run things in parallel, so as to
fully utilize a CPUs cores and threads.
The following are the key concepts relating to DataFlows.
Definition¶
The name of the data type, and what it’s primitive is. Primitive meaning is it a string, integer, float, etc.
If a piece of data created or used of this data type needs to be locked, the
definition will also specify that (lock=True
).
Operation¶
The definition of some routine or function that will process some input data and produce some output data. It contains the names of the inputs and outputs, what stage the operation runs in, and the name of the operation.
Stage¶
Operations can be run at various different stages.
Processing
Operations with this stage will be run until no new permutations of their input parameters exist.
Cleanup
After there are no operations to be run in the processing stage, cleanup operations are run to free any resources created during processing.
Output
Used to get data out of the network. Operations running in the output Stage will produce the data used as the result of running all the operations.
Operation Implementation¶
The routine or function responsible for preforming an Operation.
We separate the concept of an operation from its implementation because the goal is to allow for transparent execution of operations written in other languages, deployed as micro services, or parts of SaaS APIs. Transparent from the view of the DataFlow which defines the interconnections between operations.
Input Network¶
All data, inputs and outputs live within the Input Network, since outputs of one operation are usually inputs to another, we refer to them all as inputs. Therefore, they all reside within the Input Network.
Operation Network¶
All the definitions of Operations reside in the Operation Network.
Operation Implementation Network¶
All the references to implementations of Operations reside in the Operation Implementation Network.
This network is responsible for the execution of any given Operation within it.
Redundancy Checker¶
Checks if an operation has been called before with a given set of input parameters. This is used because a DataFlow runs by executing all possible permutations of inputs for any given operation, and completes when no new permutations of inputs exist for every operation.
Lock Network¶
Manages locking of input data so that operations can run concurrently without managing their own resource locking.
Orchestrator¶
The orchestrator uses the various networks to execute dataflows.