DataFlows¶
A running DataFlow is an event loop. First we’ll look at terminology associated with DataFlows. Then we’ll go through the sequence of events that constitute the running of a DataFlow. Lastly we’ll go over the benefits of using DataFlows.
Terminology¶
-
Things that will happen when the DataFlow is running. They define inputs and outputs. Inputs are the data they require to run, and outputs are the data they produce as a result.
Similar to a function prototype in C, an
Operation
only contains metadata.
-
The implementation of an
Operation
. This is the code that gets run when we talk about “running an operation”.A Python function can be an
OperationImplementation
-
Data that will be given to an
Operation
when it runs.
DataFlow
Description of how
Operations
are connected.Defines where
Operations
should get their inputs from.Inputs can be received from the outputs of other operations, predefined
seed
values, or anywhere else.
-
The runner of the DataFlow. Facilitates running of operations and manages input data.
The
Orchestrator
makes use of four different “Networks” and aRedundancyChecker
.The
InputNetwork
stores all the (Input
) data. It accepts incoming data and notifies theOrchestrator
when there is new data.The
OperationNetwork
stores allOperations
theOrchestrator
knows about.The
OperationImplementationNetwork
is responsible for running anOperation
with a set ofInputs
. A unique set ofInputs
for anOperation
is known as aParameterSet
.The
LockNetwork
manages locking ofInputs
. This is used when theDefinition
of the data type of anInput
declares that it may only be used when locked.The
RedundancyChecker
ensures thatOperations
don’t get run with the sameParameterSet
more than once.
Operations
get their inputs from the outputs of otherOperations
within the sameInputSetContext
.InputSetContexts
create barriers which preventInputs
within one context from being combined withInputs
within another context.
What Happens When A DataFlow Runs¶
When the Orchestrator
starts
running a DataFlow. The following sequence of events take place.
OperationImplementationNetwork
instantiates all of theOperationImplementations
that are needed by the DataFlow.Our first stage is the
Processing Stage
, where data will be generated.The
Orchestrator
kicks off any contexts that were given to therun
method along with the inputs for each context.All
seed
Inputs
are added to each context.All inputs for each context are added to the
InputNetwork
. This is theNew Inputs
step in the flow chart below.
The
OperationNetwork
looks at what inputs just arrived. Itdetermines which Operations may have new parameter sets
. If anOperation
has inputs whose possible origins include the origin of one of the inputs which just arrived, then it may have a newParameterSet
.We
generate Operation parameter set pairs
by checking if there are any new permutations ofInputs
for anOperation
. If theRedundancyChecker
has no record of that permutation being run we create a newParameterSet
composed of thoseInputs
.We
dispatch operations for running
which have newParameterSets
.The
LockNetwork
locks any ofInputs
which can’t have multiple operations use them at the same time.The
OperationImplementationNetwork
runs each operation using given parameter set as inputs
.The outputs of the
Operation
are added to theInputNetwork
and the loop repeats.Once there are no more
Operation
ParameterSet
pairs which theRedundancyChecker
knows to be unique, theCleanup Stage
begins.The
Cleanup Stage
contains operations which will release any underlying resources allocated forInputs
generated during theProcessing Stage
.Finally the
Output Stage
runs.Operations
running in this stage query theInputNetwork
to organize the data within it into the users desired output format.
Benefits of DataFlows¶
Modularity
Adding a layer of abstraction to separate the operations from their implementations means we focus on the logic of the application rather than how it’s implemented.
Implementations are easily unit testable. They can be swapped out for another implementation with similar functionality. For example if you had a “send email” operation you could swap the implementation from sending via your email server to sending via a third party service.
Visibility
Inputs are tracked to understand where they came from and or what sequence of operations generated them.
DataFlows can be visualized to understand where inputs can come from. What you see is what you get. Diagrams showing how your application works in your documentation will never get out of sync.
Ease of use
Execute code concurrently with managed locking of
Inputs
which require locks to be used safely in a concurrent environment.If a resource can only be used by one operation at a time, the writer of the operation doesn’t need concern themselves of how to prevent against unknown user defined operations clobbering it. The
Orchestrator
manages locking.As DFFML is plugin based, this enables developers to easily write and publish operations without users having to worry about how various operations will interact with each other.
DataFlows can be used in many environments. They are a generic way to describe application logic and not tied to any particular programming language (currently we only have an implementation for Python, we provide multiple deployment options).
Security
Clear trust boundaries via
Input
origins and built in input validation enable developers to ensure that untrusted inputs are properly validated.DataFlows are a serializeable programming language agnostic concept which can be validated according to any set of custom rules.