DataFlows

A running DataFlow is an event loop. First we’ll look at terminology associated with DataFlows. Then we’ll go through the sequence of events that constitute the running of a DataFlow. Lastly we’ll go over the benefits of using DataFlows.

Terminology

What Happens When A DataFlow Runs

When the Orchestrator starts running a DataFlow. The following sequence of events take place.

  • OperationImplementationNetwork instantiates all of the OperationImplementations that are needed by the DataFlow.

  • Our first stage is the Processing Stage, where data will be generated.

  • The Orchestrator kicks off any contexts that were given to the run method along with the inputs for each context.

    • All seed Inputs are added to each context.

    • All inputs for each context are added to the InputNetwork. This is the New Inputs step in the flow chart below.

  • The OperationNetwork looks at what inputs just arrived. It determines which Operations may have new parameter sets. If an Operation has inputs whose possible origins include the origin of one of the inputs which just arrived, then it may have a new ParameterSet.

  • We generate Operation parameter set pairs by checking if there are any new permutations of Inputs for an Operation. If the RedundancyChecker has no record of that permutation being run we create a new ParameterSet composed of those Inputs.

  • We dispatch operations for running which have new ParameterSets.

  • The LockNetwork locks any of Inputs which can’t have multiple operations use them at the same time.

  • The OperationImplementationNetwork runs each operation using given parameter set as inputs.

  • The outputs of the Operation are added to the InputNetwork and the loop repeats.

  • Once there are no more Operation ParameterSet pairs which the RedundancyChecker knows to be unique, the Cleanup Stage begins.

  • The Cleanup Stage contains operations which will release any underlying resources allocated for Inputs generated during the Processing Stage.

  • Finally the Output Stage runs. Operations running in this stage query the InputNetwork to organize the data within it into the users desired output format.

Flow chart showing how DataFlow Orchestrator works

Benefits of DataFlows

  • Modularity

    • Adding a layer of abstraction to separate the operations from their implementations means we focus on the logic of the application rather than how it’s implemented.

    • Implementations are easily unit testable. They can be swapped out for another implementation with similar functionality. For example if you had a “send email” operation you could swap the implementation from sending via your email server to sending via a third party service.

  • Visibility

    • Inputs are tracked to understand where they came from and or what sequence of operations generated them.

    • DataFlows can be visualized to understand where inputs can come from. What you see is what you get. Diagrams showing how your application works in your documentation will never get out of sync.

  • Ease of use

    • Execute code concurrently with managed locking of Inputs which require locks to be used safely in a concurrent environment.

      • If a resource can only be used by one operation at a time, the writer of the operation doesn’t need concern themselves of how to prevent against unknown user defined operations clobbering it. The Orchestrator manages locking.

      • As DFFML is plugin based, this enables developers to easily write and publish operations without users having to worry about how various operations will interact with each other.

    • DataFlows can be used in many environments. They are a generic way to describe application logic and not tied to any particular programming language (currently we only have an implementation for Python, we provide multiple deployment options).

  • Security

    • Clear trust boundaries via Input origins and built in input validation enable developers to ensure that untrusted inputs are properly validated.

    • DataFlows are a serializeable programming language agnostic concept which can be validated according to any set of custom rules.