JSON Pipelines

A critical part of any application is the pipeline(s) that gets executed. Building pipelines in JSON provides developers a way to construct Spark applications using existing step libraries without the need to write and deploy code. Once a pipeline is designed it may be delivered several ways:

As part of an application JSON
An API (a custom PipelineManager or DriverSetup would be required)
Embedded in a jar
A disk (local/hdfs/s3, etc…) location (a custom PipelineManager or DriverSetup would be required)

The Metalus library will then convert the JSON into a scala object to be executed.

There are three components needed to create a pipeline:

Step Template - This is the JSON that describes the step. Pipeline Steps start with this template and modify to satisfy pipeline requirements.
Pipeline Step - The step template is copied and modified with additional attributes and mappings before being added to the pipeline JSON.
Pipeline - Contains information about the pipeline including which steps to execute at runtime.

Pipeline Base

A pipeline has a base structure that is used to execute steps:

{
    "id": "b6eed286-9f7e-4ca2-a448-a5ccfbf34e6b",
    "name": "My Pipeline",
    "steps": []
}

id - A unique GUID that represents this pipeline
name - A name that may be used in logging during the execution of the pipeline
steps - The list of pipeline steps that will be executed

Optionally, a system, may add additional information such as layout (for the steps) and metadata management (create, modified date/user).

Pipeline Steps

A pipeline step begins with a step template, but makes several crucial attribute changes. More information can be found here.

Pipeline Final

Below is an example of how a basic two step pipeline may look once complete:

{
    "id": "",
    "name": "",
    "category": "pipeline",
    "steps": [
        {
            "id": "LOADFROMPATH",
            "displayName": "Load DataFrame from HDFS path",
            "description": "This step will read a dataFrame from the given HDFS path",
            "type": "Pipeline",
            "category": "InputOutput",
            "nextStepId": "WRITE",
            "params": [
                {
                    "type": "text",
                    "name": "path",
                    "required": false,
                    "value": "/tmp/input_file.csv"
                },
                {
                    "type": "object",
                    "name": "options",
                    "required": false,
                    "className": "com.acxiom.pipeline.steps.DataFrameReaderOptions"
                }
            ],
            "engineMeta": {
                "spark": "HDFSSteps.readFromPath",
                "pkg": "com.acxiom.pipeline.steps"
            },
            "tags": [
              "metalus-common_2.11-spark_2.4-1.5.0-SNAPSHOT.jar"
            ],
            "stepId": "87db259d-606e-46eb-b723-82923349640f"
        },
        {
            "id": "WRITE",
            "displayName": "Write DataFrame to table using JDBCOptions",
            "description": "This step will write a DataFrame as a table using JDBCOptions",
            "type": "Pipeline",
            "category": "InputOutput",
            "params": [
                {
                    "type": "text",
                    "name": "dataFrame",
                    "required": false,
                    "value": "@LOADFROMPATH"
                },
                {
                    "type": "text",
                    "name": "jdbcOptions",
                    "required": false
                },
                {
                    "type": "text",
                    "name": "saveMode",
                    "required": false
                }
            ],
            "engineMeta": {
                "spark": "JDBCSteps.writeWithJDBCOptions",
                "pkg": "com.acxiom.pipeline.steps"
            },
            "stepId": "c9fddf52-34b1-4216-a049-10c33ccd24ab"
        }
    ]
}