Databricks Certified Data Engineer Associate Study Guide

Book description

Data engineers proficient in Databricks are currently in high demand. As organizations gather more data than ever before, skilled data engineers on platforms like Databricks become critical to business success. The Databricks Data Engineer Associate certification is proof that you have a complete understanding of the Databricks platform and its capabilities, as well as the essential skills to effectively execute various data engineering tasks on the platform.

In this comprehensive study guide, you will build a strong foundation in all topics covered on the certification exam, including the Databricks Lakehouse and its tools and benefits. You'll also learn to develop ETL pipelines in both batch and streaming modes. Moreover, you'll discover how to orchestrate data workflows and design dashboards while maintaining data governance. Finally, you'll dive into the finer points of exactly what's on the exam and learn to prepare for it with mock tests.

Author Derar Alhussein teaches you not only the fundamental concepts but also provides hands-on exercises to reinforce your understanding. From setting up your Databricks workspace to deploying production pipelines, each chapter is carefully crafted to equip you with the skills needed to master the Databricks Platform. By the end of this book, you'll know everything you need to ace the Databricks Data Engineer Associate certification exam with flying colors, and start your career as a certified data engineer from Databricks!

You'll learn how to:

  • Use the Databricks Platform and Delta Lake effectively
  • Perform advanced ETL tasks using Apache Spark SQL
  • Design multi-hop architecture to process data incrementally
  • Build production pipelines using Delta Live Tables and Databricks Jobs
  • Implement data governance using Databricks SQL and Unity Catalog

Derar Alhussein is a senior data engineer with a master's degree in data mining. He has over a decade of hands-on experience in software and data projects, including large-scale projects on Databricks. He currently holds eight certifications from Databricks, showcasing his proficiency in the field. Derar is also an experienced instructor, with a proven track record of success in training thousands of data engineers, helping them to develop their skills and obtain professional certifications.

Publisher resources

View/Submit Errata

Table of contents

  1. 1. Managing Data with Delta Lake
    1. Introducing Delta Lake
      1. What is Delta Lake?
      2. Delta Lake Transaction Log
      3. Understanding Delta Lake Functionality
      4. Delta Lake Advantages
    2. Working with Delta Lake Tables
      1. Creating Tables
      2. Catalog Explorer
      3. Inserting Data
      4. Exploring Table Directory
      5. Exploring Table History
    3. Exploring Delta Time Travel
      1. Querying Older versions
      2. Rollbacking Back to Previous Versions
    4. Optimizing Delta Lake Tables
      1. Z-Order Indexing
    5. Vacuuming
      1. Vacuuming in Action
    6. Dropping Delta Lake Tables
  2. 2. Mastering Relational Entities in Databricks
    1. Understanding Relational Entities
      1. Databases in Databricks
      2. Tables in Databricks
    2. Putting Relational Entities Into Practice
      1. Working in the default Schema
      2. Working In a New Schema
      3. Working In a Custom-Location Schema
    3. Setting Up Delta Tables
      1. CTAS statements
      2. Comparing CREATE TABLE vs. CTAS
      3. Table Constraints
      4. Cloning Delta Lake Tables
    4. Exploring Views
      1. View Types
      2. Comparison of View Types
  3. 3. Transforming Data with Apache Spark
    1. Querying Data Files
      1. Querying JSON Format
      2. Querying text Format
      3. Querying binaryFile Format
      4. Querying Non Self-Describing Formats
      5. Registering Tables on Foreign Data Sources
    2. Writing to Tables
      1. Replacing Data
      2. Appending Data
      3. Merging Data
    3. Performing Advanced ETL Transformations
      1. Dealing with Nested JSON Data
      2. Parsing JSON into Struct Type
      3. Interacting with Struct Type
      4. Flattening Struct Types
      5. Leveraging the Explode Function
      6. Aggregating Unique Values
      7. Mastering Join Operations in Spark SQL
      8. Exploring Set Operations in Spark SQL
      9. Changing Data Perspectives
    4. Working with Higher Order Functions
      1. Filter Function
      2. Transform Function
    5. Developing SQL UDFs
      1. Creating UDFs
      2. Applying UDFs
      3. Analyzing UDFs
      4. Complex Logic UDFs
      5. Dropping UDFs
  4. 4. Processing Incremental Data
    1. Streaming Data with Apache Spark
      1. What is a Data Stream ?
      2. Spark Structured Streaming
      3. Delta Lake as Streaming Source
      4. Streaming Query Configurations
      5. Structured Streaming Guarantees
      6. Unsupported Operations
    2. Implementing Structured Streaming
      1. Streaming Data Manipulations in SQL
      2. Streaming Data Manipulations in Python
    3. Incremental Data Ingestion
      1. Introducing Data Ingestion
      2. COPY INTO Command
      3. Auto Loader
      4. Comparison of Ingestion Mechanisms
    4. Auto Loader in Action
      1. Setting up Auto Loader
      2. Observing Auto Loader
      3. Exploring Table History
      4. Cleaning Up
    5. Multi-Hop Architecture
      1. Introducing Multi-Hop Architecture
      2. Benefits of Multi-Hop Architecture
      3. Building Multi-Hop Architectures
      4. Establishing the Bronze Layer
      5. Transitioning to the Silver Layer
      6. Advancing to the Gold Layer
      7. Stopping active streams

Product information

  • Title: Databricks Certified Data Engineer Associate Study Guide
  • Author(s): Derar Alhussein
  • Release date: February 2025
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098166830