DataFlows¶
A running DataFlow is an event loop. First we’ll look at terminology associated with DataFlows. Then we’ll go through the sequence of events that constitute the running of a DataFlow. Lastly we’ll go over the benefits of using DataFlows.
Terminology¶
-
Things that will happen when the DataFlow is running. They define inputs and outputs. Inputs are the data they require to run, and outputs are the data they produce as a result.
Similar to a function prototype in C, an
Operationonly contains metadata.
-
The implementation of an
Operation. This is the code that gets run when we talk about “running an operation”.A Python function can be an
OperationImplementation
-
Data that will be given to an
Operationwhen it runs.
DataFlowDescription of how
Operationsare connected.Defines where
Operationsshould get their inputs from.Inputs can be received from the outputs of other operations, predefined
seedvalues, or anywhere else.
-
The runner of the DataFlow. Facilitates running of operations and manages input data.
The
Orchestratormakes use of four different “Networks” and aRedundancyChecker.The
InputNetworkstores all the (Input) data. It accepts incoming data and notifies theOrchestratorwhen there is new data.The
OperationNetworkstores allOperationstheOrchestratorknows about.The
OperationImplementationNetworkis responsible for running anOperationwith a set ofInputs. A unique set ofInputsfor anOperationis known as aParameterSet.The
LockNetworkmanages locking ofInputs. This is used when theDefinitionof the data type of anInputdeclares that it may only be used when locked.The
RedundancyCheckerensures thatOperationsdon’t get run with the sameParameterSetmore than once.
Operationsget their inputs from the outputs of otherOperationswithin the sameInputSetContext.InputSetContextscreate barriers which preventInputswithin one context from being combined withInputswithin another context.
What Happens When A DataFlow Runs¶
When the Orchestrator starts
running a DataFlow. The following sequence of events take place.
OperationImplementationNetworkinstantiates all of theOperationImplementationsthat are needed by the DataFlow.Our first stage is the
Processing Stage, where data will be generated.The
Orchestratorkicks off any contexts that were given to therunmethod along with the inputs for each context.All
seedInputsare added to each context.All inputs for each context are added to the
InputNetwork. This is theNew Inputsstep in the flow chart below.
The
OperationNetworklooks at what inputs just arrived. Itdetermines which Operations may have new parameter sets. If anOperationhas inputs whose possible origins include the origin of one of the inputs which just arrived, then it may have a newParameterSet.We
generate Operation parameter set pairsby checking if there are any new permutations ofInputsfor anOperation. If theRedundancyCheckerhas no record of that permutation being run we create a newParameterSetcomposed of thoseInputs.We
dispatch operations for runningwhich have newParameterSets.The
LockNetworklocks any ofInputswhich can’t have multiple operations use them at the same time.The
OperationImplementationNetworkruns each operation using given parameter set as inputs.The outputs of the
Operationare added to theInputNetworkand the loop repeats.Once there are no more
OperationParameterSetpairs which theRedundancyCheckerknows to be unique, theCleanup Stagebegins.The
Cleanup Stagecontains operations which will release any underlying resources allocated forInputsgenerated during theProcessing Stage.Finally the
Output Stageruns.Operationsrunning in this stage query theInputNetworkto organize the data within it into the users desired output format.
Benefits of DataFlows¶
Modularity
Adding a layer of abstraction to separate the operations from their implementations means we focus on the logic of the application rather than how it’s implemented.
Implementations are easily unit testable. They can be swapped out for another implementation with similar functionality. For example if you had a “send email” operation you could swap the implementation from sending via your email server to sending via a third party service.
Visibility
Inputs are tracked to understand where they came from and or what sequence of operations generated them.
DataFlows can be visualized to understand where inputs can come from. What you see is what you get. Diagrams showing how your application works in your documentation will never get out of sync.
Ease of use
Execute code concurrently with managed locking of
Inputswhich require locks to be used safely in a concurrent environment.If a resource can only be used by one operation at a time, the writer of the operation doesn’t need concern themselves of how to prevent against unknown user defined operations clobbering it. The
Orchestratormanages locking.As DFFML is plugin based, this enables developers to easily write and publish operations without users having to worry about how various operations will interact with each other.
DataFlows can be used in many environments. They are a generic way to describe application logic and not tied to any particular programming language (currently we only have an implementation for Python, we provide multiple deployment options).
Security
Clear trust boundaries via
Inputorigins and built in input validation enable developers to ensure that untrusted inputs are properly validated.DataFlows are a serializeable programming language agnostic concept which can be validated according to any set of custom rules.