metalus

This project aims to make writing Spark applications easier by abstracting the effort to assemble the driver into reusable steps and pipelines.

View project on GitHub
Documentation Home Common Home

DataSteps

DataSteps provides the user with steps that can help process data. This includes grooupBy, join, union and adding columns.

##join() This step will join two data frames together. An optional string expression can be provided. Consult the spark documents for a complete list of supported join types.

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |left|Dataset[]|left side of the join| n/a | |right|Dataset[]|right side of the join| n/a | |expression|String|optional join expression|n/a| |leftAlias|String|alias for the left side of the join|left| |rightAlias|String|alias for the right side of the join|right| |joinType|String|type of join to perform| inner|

##groupBy() This step will perform a group by operation on the provided data frame. The resulting DataFrame will have a combination of the grouping expressions and aggregations provided.

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |dataFrame|Dataset[_]|data frame to group| n/a | |groupings|List[String]|list of string expressions to group on|n/a| |aggregations|List[String]|list of string aggregate expressions|n/a|

##union() This step will perform a union operation. The underlying implementation calls spark’s union by name function, so column order is irrelevant. UNION ALL and UNION DISTINCT behavior can be toggled, and will perform a DISTINCT by default.

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |dataFrame|Dataset[]|initial data frame| n/a | |append|Dataset[]|data frame to append| n/a | |distinct|Boolean|flag to indicate whether a distinct should be performed|true|

applyFilter()

This step will apply a filter to an existing data frame returning only rows that pass the expression criteria. The expression is passed as a String and acts much like a ‘where’ clause in a sql statement. Any columns on the input dataframe can be used in the expression.

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |dataFrame|Dataset[_]|a data frame containing data to be filtered| n/a | |expression|String|the expression containing the filter criteria|n/a|

addUniqueIdToDataFrame()

This step will add a unique Id to an existing dataframe (using the ‘monatonically_increase_id’ spark udf)

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |idColumnName|String|the name to use for the newly created attribute| n/a | |dataFrame|Dataset[_]|the data frame to modify| n/a |

addStaticColumnToDataFrame()

This step will add a static value to every row of an existing DataFrame

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |dataFrame|Dataset[_]|the data frame to modify| n/a | |columnName|String|the name for the new attribute| n/a | |columnValue|Any|the string value to set for the new attribute on each row| n/a | |standardizeColumnName|Boolean|flag to control whether the column name should be standardized|true|

getDataFrameCount()

This step will get a count of the records in a DataFrame

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |dataFrame|Dataset[_]|the DataFrame to count| n/a |

dropDuplicateRecords()

This step will get a count of the records in a DataFrame

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |dataFrame|Dataset[_]|the DataFrame to drop duplicate records from| n/a | |columnNames|String|column names to use for determining distinct values to drop| n/a |

renameColumn()

This step will rename a column in a DataFrame

Input Parameters

| Name | Type | Description | Default | | — |:—|:— |:—:| |dataFrame|Dataset[_]|the DataFrame to change| n/a | |oldColumnName|String|the name of the column you want to change| n/a | |newColumnName|String|the new name to give the column| n/a |