Chapter 3. ETL with Spark

So we have gone through the architecture of Spark, and have had some detailed level discussions around RDDs. By the end of Chapter 2Transformations and Actions with Spark RDDs, we had focused on PairRDDs and some of the transformations.

This chapter focuses on doing ETL with Apache Spark. We'll cover the following topics, which hopefully will help you with taking the next step on Apache Spark:

  • Understanding the ETL process
  • Commonly supported file formats
  • Commonly supported filesystems
  • Working with NoSQL databases

Let's get started!

What is ETL?

ELT stands for Extraction, Transformation,and Loading. The term has been around for decades and it represents an industry standard representing the data movement and transformation process ...

Get Learning Apache Spark 2 now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.