metalus

This project aims to make writing Spark applications easier by abstracting the effort to assemble the driver into reusable steps and pipelines.

View project on GitHub

Documentation Home

Pipeline Execution and Data Flow

The order that pipeline steps execute is independent of the data flow. As each step executes, it may or may not use the output of the previous step. The only hard requirement is that a step may not reference the output of a step that has not executed.

Basic Step Flow

Below is an example of a pipeline containing four steps, globals and runtime parameters. The main path arrows in gray illustrate the execution flow, while the purple arrows illustrate how data is mapped during execution.

Basic Step Flow

Execution Flow

  • Step 1
  • Step 2
  • Step 3
  • Step 4

Data Flow

  1. Step 1 has data mapped from globals.
  2. Step 3 has the output of Step 1 mapped as input.
  3. Step 2 has data mapped from globals.
  4. The output of Step 3 is mapped to the input of Step 4.
  5. One or more runtime parameters will be mapped to the input of Step 4.
  6. One ore more globals will be mapped to the input of Step 4.

Branch Step Flow

The separation of execution and data flow is easier to illustrate with a more complex example. Below is a pipeline that contains a branch step. The main path arrows in gray illustrate the execution flow, while the purple arrows illustrate how data is mapped during execution.

Branch Step Flow

Execution Flow

  • Step 1
  • Step 2
  • The branch step will choose one of three paths.

Path A

  • Step 3

    Path B

  • Step 7

    Path C

  • Step 5
  • Step 6

    Final Step

  • Step 9

Data Flow

  1. Step 1 has data mapped from globals.
  2. Step 7 has the output of Step 1 mapped as input.
  3. Step 5 has the output of Step 1 mapped as input.
  4. Step 2 has data mapped from globals.
  5. Step 2 has data mapped from runtime parameters.
  6. Step 3 has the output of Step 2 mapped as input.
  7. Step 6 has the output of Step 5 mapped as input.
  8. Step 9 has the output of Step 3 mapped as input.
  9. Step 9 has the output of Step 7 mapped as input.
  10. Step 9 has the output of Step 6 mapped as input.
  11. Step 9 has data mapped from runtime parameters.

Items 8, 9 and 10 may all be mapped to that step or may use alternate value mapping to map the output of the step that actually got executed.