metalus

This project aims to make writing Spark applications easier by abstracting the effort to assemble the driver into reusable steps and pipelines.

View project on GitHub

Documentation Home

JSON Pipelines

A critical part of any application is the pipeline(s) that gets executed. Building pipelines in JSON provides developers a way to construct Spark applications using existing step libraries without the need to write and deploy code. Once a pipeline is designed it may be delivered several ways:

  • As part of an application JSON
  • An API (a custom PipelineManager or DriverSetup would be required)
  • Embedded in a jar
  • A disk (local/hdfs/s3, etc…) location (a custom PipelineManager or DriverSetup would be required)

The Metalus library will then convert the JSON into a scala object to be executed.

There are three components needed to create a pipeline:

  • Step Template - This is the JSON that describes the step. Pipeline Steps start with this template and modify to satisfy pipeline requirements.
  • Pipeline Step - The step template is copied and modified with additional attributes and mappings before being added to the pipeline JSON.
  • Pipeline - Contains information about the pipeline including which steps to execute at runtime.

Pipeline Base

A pipeline has a base structure that is used to execute steps:

{
    "id": "b6eed286-9f7e-4ca2-a448-a5ccfbf34e6b",
    "name": "My Pipeline",
    "steps": []
}
  • id - A unique GUID that represents this pipeline
  • name - A name that may be used in logging during the execution of the pipeline
  • steps - The list of pipeline steps that will be executed

Optionally, a system, may add additional information such as layout (for the steps) and metadata management (create, modified date/user).

Pipeline Steps

A pipeline step begins with a step template, but makes several crucial attribute changes. More information can be found here.

Pipeline Final

Below is an example of how a basic two step pipeline may look once complete:

{
    "id": "",
    "name": "",
    "category": "pipeline",
    "steps": [
        {
            "id": "LOADFROMPATH",
            "displayName": "Load DataFrame from HDFS path",
            "description": "This step will read a dataFrame from the given HDFS path",
            "type": "Pipeline",
            "category": "InputOutput",
            "nextStepId": "WRITE",
            "params": [
                {
                    "type": "text",
                    "name": "path",
                    "required": false,
                    "value": "/tmp/input_file.csv"
                },
                {
                    "type": "object",
                    "name": "options",
                    "required": false,
                    "className": "com.acxiom.pipeline.steps.DataFrameReaderOptions"
                }
            ],
            "engineMeta": {
                "spark": "HDFSSteps.readFromPath",
                "pkg": "com.acxiom.pipeline.steps"
            },
            "tags": [
              "metalus-common_2.11-spark_2.4-1.5.0-SNAPSHOT.jar"
            ],
            "stepId": "87db259d-606e-46eb-b723-82923349640f"
        },
        {
            "id": "WRITE",
            "displayName": "Write DataFrame to table using JDBCOptions",
            "description": "This step will write a DataFrame as a table using JDBCOptions",
            "type": "Pipeline",
            "category": "InputOutput",
            "params": [
                {
                    "type": "text",
                    "name": "dataFrame",
                    "required": false,
                    "value": "@LOADFROMPATH"
                },
                {
                    "type": "text",
                    "name": "jdbcOptions",
                    "required": false
                },
                {
                    "type": "text",
                    "name": "saveMode",
                    "required": false
                }
            ],
            "engineMeta": {
                "spark": "JDBCSteps.writeWithJDBCOptions",
                "pkg": "com.acxiom.pipeline.steps"
            },
            "stepId": "c9fddf52-34b1-4216-a049-10c33ccd24ab"
        }
    ]
}