Section 1: Data Engineering

This section introduces the uninitiated to the Distributed Computing paradigm and shows how Spark became the de facto standard for big data processing.

Upon completion of this section, you will be able to ingest data from various data sources, cleanse it, integrate it, and write it out to persistent storage such as a data lake in a scalable and distributed manner. You will also be able to build real-time analytics pipelines and perform change data capture in a data lake. You will understand the key differences between the ETL and ELT ways of data processing, and how ELT evolved for the cloud-based data lake world. This section also introduces you to Delta Lake to make cloud-based data lakes more reliable and performant. ...

Get Essential PySpark for Scalable Data Analytics now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.