metalus

This project aims to make writing Spark applications easier by abstracting the effort to assemble the driver into reusable steps and pipelines.

View project on GitHub
Documentation Home Common Home

Streaming Query Monitor

Streaming Query Monitors provide a method for interacting with a running Spark StreamingQuery object. Implementations can be created to perform different types of monitoring, control whether the query is stopped, whether to continue and provide a map of variables that will be placed on the globals for the next step. A step is provided which is designed to use the monitors and provide a decision to continue or stop.

BaseStreamingQueryMonitor

This implementation doesn’t do anything but allow continuous streaming. The query doesn’t actually get monitored or stopped.

BatchWriteStreamingQueryMonitor

This abstract class provides a base implementation for streaming queries that need to write to disk. Basic status checking is provided and abstract functions that can be overridden to control behavior. This class uses the following globals to determine behavior:

  • STREAMING_BATCH_MONITOR_TYPE - Either duration or count.
  • STREAMING_BATCH_MONITOR_DURATION - A number representing the time amount
  • STREAMING_BATCH_MONITOR_DURATION_TYPE - Either milliseconds, seconds or minutes.
  • STREAMING_BATCH_MONITOR_COUNT - A number representing the approximate number of records to process before stopping the query.

    BatchPartitionedStreamingQueryMonitor (com.acxiom.pipeline.streaming.BatchPartitionedStreamingQueryMonitor)

    This implementation will process until the limit is reached and then stop the query. The query should continue and a new partition value will be provided on the globals. The partition value will be pushed to the global specified using the STREAMING_BATCH_PARTITION_GLOBAL global. Note: When using this monitor, the checkpointLocation must be specified so that a single location is used throughout.

  • STREAMING_BATCH_PARTITION_COUNTER - Track the number of times this has been invoked.
  • STREAMING_BATCH_PARTITION_GLOBAL - The name of the global key that contains the name of the partition value.
  • STREAMING_BATCH_PARTITION_TEMPLATE - Indicates whether to use a counter or date.

The date sting will be formatted using this template: yyyy-dd-MM HH:mm:ssZ

(Experimental) BatchFileStreamingQueryMonitor (com.acxiom.pipeline.streaming.BatchFileStreamingQueryMonitor)

This implementation will process until the limit is reached and then stop the query. The query should continue and a new destination value will be provided on the globals. The destination will be modified by appending an underscore and the template value (counter or date). The destination global will then be updated with the new value.

The following parameters are required to use this monitor:

  • STREAMING_BATCH_OUTPUT_PATH_KEY - Item in the path that needs to be tagged.
  • STREAMING_BATCH_OUTPUT_GLOBAL - Provides the global key to get the path being used for output.
  • STREAMING_BATCH_OUTPUT_TEMPLATE - Indicates whether to use a counter or date.
  • STREAMING_BATCH_OUTPUT_COUNTER - Track the number of times this has been invoked.

The date sting will be formatted using this template: yyyy_dd_MM_HH_mm_ss_SSS