Data Engineering with dbt

Book description

Use easy-to-apply patterns in SQL and Python to adopt modern analytics engineering to build agile platforms with dbt that are well-tested and simple to extend and run Purchase of the print or Kindle book includes a free PDF eBook

Key Features

  • Build a solid dbt base and learn data modeling and the modern data stack to become an analytics engineer
  • Build automated and reliable pipelines to deploy, test, run, and monitor ELTs with dbt Cloud
  • Guided dbt + Snowflake project to build a pattern-based architecture that delivers reliable datasets

Book Description

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps.

This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work.

By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.

What you will learn

  • Create a dbt Cloud account and understand the ELT workflow
  • Combine Snowflake and dbt for building modern data engineering pipelines
  • Use SQL to transform raw data into usable data, and test its accuracy
  • Write dbt macros and use Jinja to apply software engineering principles
  • Test data and transformations to ensure reliability and data quality
  • Build a lightweight pragmatic data platform using proven patterns
  • Write easy-to-maintain idempotent code using dbt materialization

Who this book is for

This book is for data engineers, analytics engineers, BI professionals, and data analysts who want to learn how to build simple, futureproof, and maintainable data platforms in an agile way. Project managers, data team managers, and decision makers looking to understand the importance of building a data platform and foster a culture of high-performing data teams will also find this book useful. Basic knowledge of SQL and data modeling will help you get the most out of the many layers of this book. The book also includes primers on many data-related subjects to help juniors get started.

Table of contents

  1. Data Engineering with dbt
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  6. Part 1: The Foundations of Data Engineering
  7. Chapter 1: The Basics of SQL to Transform Data
    1. Technical requirements
    2. Introducing SQL
    3. SQL basics – core concepts and commands
      1. SQL core concepts
      2. Understanding the categories of SQL commands
    4. Setting up a Snowflake database with users and roles
      1. Creating your Snowflake account
      2. Setting up initial users, roles, and a database in Snowflake
      3. Creating and granting your first role
    5. Querying data in SQL – syntax and operators
      1. Snowflake query syntax
      2. SQL operators
    6. Combining data in SQL – the JOIN clause
      1. Combining orders and customers
      2. JOIN types
      3. Visual representation of join types
    7. Advanced – introducing window functions
      1. Window definition
      2. Window frame definition
    8. Summary
    9. Further reading
  8. Chapter 2: Setting Up Your dbt Cloud Development Environment
    1. Technical requirements
    2. Setting up your GitHub account
      1. Introducing Version Control
      2. Creating your GitHub account
      3. Setting up your first repository for dbt
    3. Setting up your dbt Cloud account
      1. Signing up for a dbt Cloud account
      2. Setting up your first dbt Cloud project
      3. Adding the default project to an empty repository
    4. Comparing dbt Core and dbt Cloud workflows
      1. dbt Core workflows
      2. dbt Cloud workflows
    5. Experimenting with SQL in dbt Cloud
      1. Exploring the dbt Cloud IDE
      2. Executing SQL from the dbt IDE
    6. Introducing the source and ref dbt functions
      1. Exploring the dbt default model
      2. Using ref and source to connect models
      3. Running your first models
      4. Testing your first models
      5. Editing your first model
    7. Summary
    8. Further reading
  9. Chapter 3: Data Modeling for Data Engineering
    1. Technical requirements
    2. What is and why do we need data modeling?
      1. Understanding data
      2. What is data modeling?
      3. Why we need data modeling
      4. Complementing a visual data model
    3. Conceptual, logical, and physical data models
      1. Conceptual data model
      2. Logical data model
      3. Physical data model
      4. Tools to draw data models
    4. Entity-Relationship modeling
      1. Main notation
      2. Cardinality
      3. Time perspective
      4. An example of an E-R model at different levels of detail
      5. Generalization and specialization
    5. Modeling use cases and patterns
      1. Header-detail use case
      2. Hierarchical relationships
      3. Forecasts and actuals
      4. Libraries of standard data models
    6. Common problems in data models
      1. Fan trap
      2. Chasm trap
    7. Modeling styles and architectures
      1. Kimball method or dimensional modeling or star schema
      2. Unified Star Schema
      3. Inmon design style
      4. Data Vault
      5. Data mesh
      6. Our approach, the Pragmatic Data Platform - PDP
    8. Summary
    9. Further reading
  10. Chapter 4: Analytics Engineering as the New Core of Data Engineering
    1. Technical requirements
    2. The data life cycle and its evolution
      1. Understanding the data flow
      2. Data creation
      3. Data movement and storage
      4. Data transformation
      5. Business reporting
      6. Feeding back to the source systems
    3. Understanding the modern data stack
      1. The traditional data stack
      2. The modern data stack
    4. Defining analytics engineering
      1. The roles in the modern data stack
      2. The analytics engineer
    5. DataOps – software engineering best practices for data
      1. Version control
      2. Quality assurance
      3. The modularity of the code base
      4. Development environments
      5. Designing for maintainability
    6. Summary
    7. Further reading
  11. Chapter 5: Transforming Data with dbt
    1. Technical requirements
    2. The dbt Core workflow for ingesting and transforming data
    3. Introducing our stock tracking project
      1. The initial data model and glossary
      2. Setting up the project in dbt, Snowflake, and GitHub
    4. Defining data sources and providing reference data
      1. Defining data sources in dbt
      2. Loading the first data for the portfolio project
    5. How to write and test transformations
      1. Writing the first dbt model
      2. Real-time lineage and project navigation
      3. Deploying the first dbt model
      4. Committing the first dbt model
      5. Configuring our project and where we store data
      6. Re-deploying our environment to the desired schema
      7. Configuring the layers for our architecture
      8. Ensuring data quality with tests
      9. Generating the documentation
    6. Summary
  12. Part 2: Agile Data Engineering with dbt
  13. Chapter 6: Writing Maintainable Code
    1. Technical requirements
    2. Writing code for humans
      1. Refactoring our initial model to be human-readable
    3. Creating the architectural layers
      1. Creating the Staging layer
      2. Goals and contents of the staging models
      3. Connecting the REF model to the STG
      4. Goals and contents of the refined layer
      5. Creating the first data mart
    4. Saving history is crucial
      1. Saving history with dbt
      2. Saving history using snapshots
      3. Connecting the REF layer with the snapshot
    5. Summary
  14. Chapter 7: Working with Dimensional Data
    1. Adding dimensional data
      1. Creating clear data models for the refined and data mart layers
    2. Loading the data of the first dimension
      1. Creating and loading a CSV as a seed
      2. Configuring the seeds and loading them
      3. Adding data types and a load timestamp to your seed
    3. Building the STG model for the first dimension
      1. Defining the external data source for seeds
      2. Creating an STG model for the security dimension
      3. Adding the default record to the STG
    4. Saving history for the dimensional data
      1. Saving the history with a snapshot
    5. Building the REF layer with the dimensional data
    6. Adding the dimensional data to the data mart
    7. Exercise – adding a few more hand-maintained dimensions
    8. Summary
  15. Chapter 8: Delivering Consistency in Your Data
    1. Technical requirements
    2. Keeping consistency by reusing code – macros
      1. Repetition is inherent in data projects
      2. Why copy and paste kills your future self
      3. How to write a macro
      4. Refactoring the “current” CTE into a macro
      5. Fixing data loaded from our CSV file
      6. The basics of macro writing
    3. Building on the shoulders of giants – dbt packages
      1. Creating dbt packages
      2. How to import a package in dbt
      3. Browsing through noteworthy packages for dbt
      4. Adding the dbt-utils package to our project
    4. Summary
  16. Chapter 9: Delivering Reliability in Your Data
    1. Testing to provide reliability
      1. Types of tests
      2. Singular tests
      3. Generic tests
      4. Defining a generic test
    2. Testing the right things in the right places
      1. What do we test?
      2. Where to test what?
      3. Testing our models to ensure good quality
    3. Summary
  17. Chapter 10: Agile Development
    1. Technical requirements
    2. Agile development and collaboration
      1. Defining agile development
    3. Applying agile to data engineering
      1. Starting a project in an agile way
      2. Organizing work the agile way
      3. Managing the backlog in an agile way
    4. Building reports in an agile way
      1. S1 – designing a light data model for the data mart
      2. S2 – designing a light data model for the REF layer
      3. S3.x – developing with dbt models the pipeline for the XYZ table
      4. S4 – an acceptance test of the data produced in the data mart
      5. S5 – development and verification of the report in the BI application
    5. Summary
  18. Chapter 11: Team Collaboration
    1. Enabling collaboration
      1. Core collaboration practices
      2. Collaboration with dbt Cloud
    2. Working with branches and PRs
    3. Working with Git in dbt Cloud
      1. The dbt Cloud Git process
      2. Keeping your development environment healthy
      3. Suggested Git branch naming
      4. Adopting frequent releases
    4. Making your first PR
    5. Summary
    6. Further reading
  19. Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms
  20. Chapter 12: Deployment, Execution, and Documentation Automation
    1. Technical requirements
    2. Designing your deployment automation
      1. Working with dbt environments
      2. Creating our QA and PROD environments
      3. Deciding where to deploy
      4. Creating jobs
      5. Designing the architecture of your data platform
      6. Notifications
    3. Advanced automation – hooks and run-operations
      1. Hooks
      2. Run-operations
      3. Table migrations
    4. Documentation
      1. Lineage graph
      2. dbt-generated documentation
      3. Source freshness report
      4. Exposures
      5. Markdown documentation
    5. Summary
  21. Chapter 13: Moving Beyond the Basics
    1. Technical requirements
    2. Building for modularity
      1. Modularity in the storage layer
      2. Modularity in the refined layer
      3. Modularity in the delivery layer
    3. Managing identity
      1. Identity and semantics – defining your concepts
      2. Different types of keys
      3. Main uses of keys
    4. Master Data management
      1. Data for Master Data management
      2. A light MDM approach with DBT
    5. Saving history at scale
      1. Understanding the save_history macro
      2. Understanding the current_from_history macro
    6. Summary
  22. Chapter 14: Enhancing Software Quality
    1. Technical requirements
    2. Refactoring and evolving models
      1. Dealing with technical debt
    3. Implementing real-world code and business rules
      1. Replacing snapshots with HIST tables
      2. Renaming the REF_ABC_BANK_SECURITY_INFO model
      3. Handling orphans in facts
      4. Calculating closed positions
      5. Calculating transactions
    4. Publishing dependable datasets
      1. Managing data marts like APIs
      2. What shape should you use for your data mart?
      3. Self-completing dimensions
      4. History in reports – that is, slowly changing dimensions type two
    5. Summary
    6. Further reading
  23. Chapter 15: Patterns for Frequent Use Cases
    1. Technical requirements
    2. Ingestion patterns
      1. Basic setup for ingestion
      2. Loading data from files
      3. External tables
      4. Landing tables
    3. History patterns
      1. Storing history with deletions – full load
      2. Storing history with deletion – deletion list
      3. Storing history with multiple versions in the input
      4. Storing history with PII and GDPR compliance
      5. History and schema evolution
    4. Summary
    5. Further reading
  24. Index
    1. Why subscribe?
  25. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Data Engineering with dbt
  • Author(s): Roberto Zagni
  • Release date: June 2023
  • Publisher(s): Packt Publishing
  • ISBN: 9781803246284