DataFlows¶

A running DataFlow is an event loop. First we’ll look at terminology associated with DataFlows. Then we’ll go through the sequence of events that constitute the running of a DataFlow. Lastly we’ll go over the benefits of using DataFlows.

Terminology¶

Operation
- Things that will happen when the DataFlow is running. They define inputs and outputs. Inputs are the data they require to run, and outputs are the data they produce as a result.
- Similar to a function prototype in C, an Operation only contains metadata.
OperationImplementation
- The implementation of an Operation. This is the code that gets run when we talk about “running an operation”.
- A Python function can be an OperationImplementation
Input
- Data that will be given to an Operation when it runs.
DataFlow
- Description of how Operations are connected.
- Defines where Operations should get their inputs from.
- Inputs can be received from the outputs of other operations, predefined seed values, or anywhere else.
Orchestrator
- The runner of the DataFlow. Facilitates running of operations and manages input data.
- The Orchestrator makes use of four different “Networks” and a RedundancyChecker.
  - The InputNetwork stores all the (Input) data. It accepts incoming data and notifies the Orchestrator when there is new data.
  - The OperationNetwork stores all Operations the Orchestrator knows about.
  - The OperationImplementationNetwork is responsible for running an Operation with a set of Inputs. A unique set of Inputs for an Operation is known as a ParameterSet.
  - The LockNetwork manages locking of Inputs. This is used when the Definition of the data type of an Input declares that it may only be used when locked.
  - The RedundancyChecker ensures that Operations don’t get run with the same ParameterSet more than once.
- Operations get their inputs from the outputs of other Operations within the same InputSetContext. InputSetContexts create barriers which prevent Inputs within one context from being combined with Inputs within another context.

What Happens When A DataFlow Runs¶

When the Orchestrator starts running a DataFlow. The following sequence of events take place.

OperationImplementationNetwork instantiates all of the OperationImplementations that are needed by the DataFlow.
Our first stage is the Processing Stage, where data will be generated.
The Orchestrator kicks off any contexts that were given to the run method along with the inputs for each context.
- All seed Inputs are added to each context.
- All inputs for each context are added to the InputNetwork. This is the New Inputs step in the flow chart below.
The OperationNetwork looks at what inputs just arrived. It determines which Operations may have new parameter sets. If an Operation has inputs whose possible origins include the origin of one of the inputs which just arrived, then it may have a new ParameterSet.
We generate Operation parameter set pairs by checking if there are any new permutations of Inputs for an Operation. If the RedundancyChecker has no record of that permutation being run we create a new ParameterSet composed of those Inputs.
We dispatch operations for running which have new ParameterSets.
The LockNetwork locks any of Inputs which can’t have multiple operations use them at the same time.
The OperationImplementationNetwork runs each operation using given parameter set as inputs.
The outputs of the Operation are added to the InputNetwork and the loop repeats.
Once there are no more Operation ParameterSet pairs which the RedundancyChecker knows to be unique, the Cleanup Stage begins.
The Cleanup Stage contains operations which will release any underlying resources allocated for Inputs generated during the Processing Stage.
Finally the Output Stage runs. Operations running in this stage query the InputNetwork to organize the data within it into the users desired output format.

Benefits of DataFlows¶

Modularity
- Adding a layer of abstraction to separate the operations from their implementations means we focus on the logic of the application rather than how it’s implemented.
- Implementations are easily unit testable. They can be swapped out for another implementation with similar functionality. For example if you had a “send email” operation you could swap the implementation from sending via your email server to sending via a third party service.
Visibility
- Inputs are tracked to understand where they came from and or what sequence of operations generated them.
- DataFlows can be visualized to understand where inputs can come from. What you see is what you get. Diagrams showing how your application works in your documentation will never get out of sync.
Ease of use
- Execute code concurrently with managed locking of Inputs which require locks to be used safely in a concurrent environment.
  - If a resource can only be used by one operation at a time, the writer of the operation doesn’t need concern themselves of how to prevent against unknown user defined operations clobbering it. The Orchestrator manages locking.
  - As DFFML is plugin based, this enables developers to easily write and publish operations without users having to worry about how various operations will interact with each other.
- DataFlows can be used in many environments. They are a generic way to describe application logic and not tied to any particular programming language (currently we only have an implementation for Python, we provide multiple deployment options).
Security
- Clear trust boundaries via Input origins and built in input validation enable developers to ensure that untrusted inputs are properly validated.
- DataFlows are a serializeable programming language agnostic concept which can be validated according to any set of custom rules.