Chapter 8. Operations on Streaming Data

Spark Structured Streaming was first introduced in Apache Spark 2.0. The main goal of Structured Streaming was to build near-real-time streaming applications on Spark. Structured Streaming replaced an older, lower-level API called DStreams (Discretized Streams), which was based upon the old Spark RDD model. Since then, Structured Streaming has added many optimizations and connectors, including integration with Delta Lake.

Delta Lake is integrated with Spark Structured Streaming through its two major operators: readStream and writeStream. Delta tables can be used as both streaming sources and streaming sinks. Delta Lake overcomes many limitations typically associated with streaming systems, including:

  • Coalescing small files produced by low-latency ingestion

  • Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs)

  • Leveraging the Delta transaction log for efficient discovery of which files are new when using files for a source stream

We will start this chapter with a quick review of Spark Structured Streaming, followed by an initial overview of Delta Lake streaming and its unique capabilities. Next, we will walk through a small “Hello Streaming World!” Delta Lake streaming example. While limited in scope, this example will provide an opportunity to understand the details of the Delta Lake streaming programming model in a very simple context.

Incremental processing of data has become a popular ETL model. The ...

Get Delta Lake: Up and Running now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.