Chapter 20. Apache Spark
This chapter demonstrates recipes for Apache Spark, a unified data analytics engine for large-scale data processing.
The Spark website describes Spark as a “unified analytics engine for large-scale data processing.” This means that it’s a big data framework that lets you analyze your data with different techniques—such as treating the data as a spreadsheet or as a database—and runs on distributed clusters. You can use Spark to analyze datasets that are so large that they span thousands of computers.
While Spark is designed to work with enormous datasets on clusters of computers, a great thing about it is that you can learn how to use Spark on your own computer with just a few example files.
Spark 3.1.1
The examples in this chapter use Spark 3.1.1, which was released in March 2021 and is the latest release at the time of this writing. Spark currently works only with Scala 2.12, so the examples in this chapter also use Scala 2.12. However, because working with Spark generally involves using collections methods like map
and filter
, or SQL queries, you’ll barely notice the difference between Scala 2 and Scala 3 in these examples.
The recipes in this chapter show how to work with Spark on your own computer, while demonstrating the key concepts that work on datasets that span thousands of computers. Recipe 20.1 demonstrates how to get started with Spark and digs into one of its fundamental concepts, the Resilient Distributed Dataset, or RDD. An RDD lets you ...
Get Scala Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.