metalus

This project aims to make writing Spark applications easier by abstracting the effort to assemble the driver into reusable steps and pipelines.

View project on GitHub
Documentation Home Common Home

Data Options

Read and write options need to be easy to configure and reusable across steps and pipelines. The DataFrameReaderOptions and DataFrameWriterOptions are provided as a way to easily model reusable instructions.

DataFrame Reader Options

This object represents options that will be translated into Spark settings at read time.

  • format - The file format to use. Defaulted to parquet.
  • schema - An optional com.acxiom.pipeline.steps.Schema which will be applied to the DataFrame.
  • options - Optional map of settings to pass to the underlying Spark data source.
  • streaming - Boolean that indicates if this object should build a streaming DataFrame. Default is false. Example:
    {
    "format": "csv",
    "options": {
      "header": "true",
      "delimiter": ","
    },
    "schema": {
      "attributes": [
        {
          "name": "CUSTOMER_ID",
          "dataType": {
            "baseType": "Integer"
          }
        },
        {
          "name": "FIRST_NAME",
          "dataType": {
            "baseType": "String"
          }
        }
      ]
    }
    }
    

    DataFrame Writer Options

    This object represents options that will be translated into Spark settings at write time.

  • format - The file format to use. Defaulted to parquet.
  • saveMode - The mode when writing a DataFrame. Defaulted to “Overwrite”
  • options - Optional map of settings to pass to the underlying Spark data source.
  • bucketingOptions - Optional BucketingOptions object for configuring Bucketing
  • partitionBy - Optional list of columns for partitioning.
  • sortBy - Optional list of columns for sorting. Example:
    {
    "format": "parquet",
    "saveMode": "Overwrite",
    "bucketingOptions": {},
    "options": {
      "escapeQuotes": false
    },
    "schema": {
      "attributes": [
        {
          "name": "CUSTOMER_ID",
          "dataType": {
            "baseType": "Integer"
          }
        },
        {
          "name": "FIRST_NAME",
          "dataType": {
            "baseType": "String"
          }
        }
      ]
    }
    }
    

    Bucketing Options

    This object represents the options used to bucket data when it is being written by Spark.

  • numBuckets - The number of buckets (partition) to build.
  • columns - A list of columns to bucket by.